Preliminaries

Having set up a working CoreOS cluster, the intrepid, viz., naive might proceed with installing applications on or even admitting users to their clustered systems.

Having seen developers and users alike wreak havoc on the most beautiful systems, I strongly advise against such folly. However, if you are bribed, coerced or otherwise motivated to do so, it is indeed possible. Employing a combination of docker and systemd, it is rather easy, I daresay.

A point of warning: This post is very long. Because deploying applications in CoreOS requires sensible packaging, scheduling and service auto-discovery, HA setups and fine-grained machine selection are practically important and I still do not want to be slack on theory, I did not manage to be shorter.

Application Containerization

With ever increasing power of computer hardware, running applications in virtualized environments has become the most common workload. Typically then, a physical machine is partitioned into several virtual systems by a piece of software called hypervisor.

Said hypervisor can then then delegate each partition to a separate operating system, which then provides kernel and system libraries for application software.

Packaging Appliances

Such application software conversely depends on application and system libraries and an operating system kernel. Integration of such software into a given set of application and system library dependencies - of different versions - soon became a, quod googlet dependency hell, a decidedly non-trivial task.

Quickly, it has been observed that when deploying application software in separate virtualized containers, it is possible to eliminate these problems. Packaging would then simplify to serializing application, all dependencies and the operating system kernel into an archive format and trivially unpacking and executing this set on a target hypervisor platform.

Archives needed to be very large, though, as the needed every dependency from kernel, libc, any libraries and the application. For cloud operations which need depend on rapidly setting up and destroying machines scaled to current application load, that approach would not work.

Container Virtualization

The situation changes when eliminating the hypervisor from the software stack. This is possible when using a technique from the operating system domain named container virtualization. This technology is not really young anymore, it hast been around since the very late 90ies in FreeBSD and Solaris.

(Interesting reads include Jails: Confining the omnipotent root, Building Systems to be Shared Securely and the Solaris Zones USENIX paper.)

Container virtualization revolves around the idea of running processes in “private namespaces” provided by the operating system kernel which the processes cannot “leave”. In the absence of a hypervisor and accordingly, hardware emulation, some prefer the term “encapsulation” to the term “virtualization”.

An application process thus encapsulated now maps to a set of application and library binaries. As long as these binaries have been compiled against compatible kernels and libcs, a packaged appliance will loose kernel and system libraries and only need to include those dependencies still missing.

Layered Packaging

Now, it can be observed that many applications have common dependencies and the problems of aforementioned dependency hell stems from a relatively small subset.

Then, in the presence of encapsulation, starting from the “most recent common ancestor”, it is possible to “layer” the package. The layer at the bottom would then include libc and those libraries which every package depends on. The next layers include packages most, but not all applications depend on and so forth. This drastically reduces package size and increases hard disk and even memory sharing.

Reducing package size also reduces time required to migrate the package over the network, which is one requirement for cloud deployments.

Deploying applications in CoreOS

As an OS for the cloud, CoreOS requires applications to utilize resources efficiently. It also needs to be able to run diverse and possibly mutually incompatible applications on the same operating system platform. Process encapsulation and layered packaging turned out to be a crucial point.

Having to expensively install and configure applications is a deal-breaker in the cloud setting. So, not applications, but appliances are deployed in CoreOS installations. In addition, in a cloud setting, operators not necessarily care whether applications have been installed. Instead, operators care whether services are available somewhere.

fleetd and CoreOS

Consequently, CoreOS does not even have a traditional package manager like the dpkg- or rpm-suites on Debian or RedHat respectively. Instead, docker and fleetd can be thought of package managers for CoreOS cloud ops:

docker (or more recently rkt) are used to package and run applications - services. fleetd is the CoreOS service scheduler: It interfaces the cluster via etcd and determines the placement on specific machines.

To do so, fleetd plugs the new Linux service management facility systemd: Appliances are started and stopped as systemd units. systemd manifests - unit files - are stored in the same distributed key-value store CoreOS uses to store cluster information.

systemd units for CoreOS

Syntactically, there is no difference between local systemd unit files and units designated for cluster use:

[Unit]
Description=nginx reverse proxy and www server
After=network.target docker.service

[Service]
ExecStartPre=/usr/bin/docker rm -vf nginx
ExecStart=/usr/bin/docker run nginx

ExecStop=/usr/bin/docker rm -vf nginx

Using fleetctl, such a unit file can be published, scheduled and executed to/on a CoreOS cluster either on a machine running fleetd itself or via ssh tunnel:

cjr@lamport: $ fleetctl start nginx.service
cjr@lamport: $ fleetctl --ssh-username=core --tunnel <IP> nginx.service

(fleetctl can be run locally on a cluster member machine or remotely via ssh-tunneling.)

Then, listing the services on the cluster

cjr@lamport: $ fleetctl --list-unit-files
UNIT          HASH    DSTATE   STATE    TARGET
nginx.service afa1506 launched launched fe64b5d8.../<IP>

shows the service being scheduled to a machine in that cluster.

Calling fleetctl yields a set of other commands which are sufficiently documented elsewhere. From the subcommands, it is important to briefly touch the semantics of the sequence submit and load. “Submitting” means “loading” the unit into the cluster and “loading” means scheduling the unit on some machine. The language may be distracting at first.

Service Auto-Discovery

Scheduling and running services cluster-wide is not very useful if services cannot be used and to be used they need to be found.

Registering Services

The pattern employed in such clusters is to register services at startup at some centrally accessible facility, which is in the case of CoreOS etcd.

[Service]
[...]
ExecStartPost=-/usr/bin/etcdctl mkdir /clusterstate
ExecStartPost=-/usr/bin/etcdctl mkdir /clusterstate/cruwe_de
ExecStartPost=/bin/bash -c 'etcdctl set /clusterstate/cruwe_de/%n 
                           "%H $(cat /etc/motd.d/info.conf \
                                  | grep Public \
                                  | cut -f 3 -d" "):8080"'
[...]

ExecStopPost=-/bin/bash -c 'etcdctl rm /clusterstate/cruwe_de/%n'
[...]

When registering, it is necessary to ensure the presence of etcd-directories - etcdctl mkdir ... will fail if the directory is present and create it when not. Failure does not have any consequences, as the command is prepended with an ‘-‘ character.

The service is then registered (%n will expand to the unit’s name and %H to the host’s name in systemd). Where to find an publicly accessible IP is highly system dependent.

Accessing Registered Services

Then, it is possible to provision a load-balancer or a reverse proxy at some very stable IP-endpoint. Note again the pattern to have valuable machines which need to be taken care of (IP-endpoints) and dispensable machines (services).

On IP-endpoints, etcd is watched and some load-balancing or reverse-proxying server reconfigured on etcd changes. Most commonly, nginx or haproxy are used to turn machines into dedicated http-switches.

While in a trivial setup, watching etcd and reconfiguring a config file might be accomplished using etcdctl, sed and some lines of ksh, dedicated software does exist, which plugs etcd with templates.

Think of a haproxy configuration: It is then possible to template the config with a tool named confd:

[template]
src = "haproxy.cfg.tmpl"
dest = "/etc/haproxy/haproxy.cfg"
keys = [
        "/clusterstate/cruwe_de"
]
reload_cmd = "echo reloading && systemctl reload haproxy.service"

Configuring confd to watch a specific etcd path and to reload haproxy on changes, it is possible to modify haproxy.cfg using a liquid-like template:

listen appname 0.0.0.0:80
    [...]
    
    {{range getvs "/clusterstate/cruwe_de/*"}}
    server {{.}} check
    {{end}}
    

Having an instance of haproxy running under a stable IP and having that instance reconfigure dynamically using confd solves the problem of accessing dynamically scheduled services in a cluster.

To have more than one process running inside one docker container, it is possible to set systemd’s /usr/sbin/init. An example. cobbled together but working, can be found on my github account.

Remember that to run systemd inside docker containers requires to yield some isolation properties:

ExecStart=/usr/bin/docker run \
        --rm \
			  --name tcp_lb_confd \
			  --hostname tcp_lb_confd \
			  -p 80:80 \
			  --privileged \
			  -t \
			  -v /sys/fs/cgroup:/sys/fs/cgroup:ro \
			  <dockerimage>

Special Cases

fleet-specific options are detailed in the documentation upstream here. It is nevertheless necessary to touch some here.

High Availability

Declaring a unit file a “template” allows for multiple instances of some service being run on the cluster’s machines. This is accomplished by appending an ‘@’-character to the name part of the unit file’s filename as in nginx@.service. Then, services are submitted using an appendix in arabic numbers as in

cjr@lamport: $ fleetctl start nginx@1.service
cjr@lamport: $ fleetctl start nginx@2.service

which schedules and executes two instances of the same unit file.

systemd will take care of unit failures local to the machine by restarting the service. fleet will re-schedule to service to a different machine when a machine leaves the cluster, either by controlled termination of fleetd or by crashing.

Controlling Scheduling on Machines

Scheduling can be refined appending an [X-Fleet] part to the unit file

[X-Fleet]
MachineID=3c1dded4

allow to precisely pin a service on a machine.

[X-Fleet]
MachineOf=another.service

enables co-location.

[X-Fleet]
MachineMetadata=COMPUTE=yes

allows to run a service on a machine where a metadata key value pair has been set accordingly.

Use

[X-Fleet]
Conflicts=another.service

to prevent co-location with another service and

[X-Fleet]
Global=true

will schedule and execute the service on all cluster machines.

Following appendix e.g. prevents two nginx HA services alongside each other (to ensure against cluster machine failure). Machines which are designated as endpoints are excluded, the idea is to have dedicated IP and/or TCP switching machines running a load balancer instance.

[X-Fleet]
Conflicts=nginx*
MachineMetadata=ENTRYPOINT=FALSE

Complex Sequences

Irrespective of cluster or non-cluster operation, when using systemd to run docker containers unattended, startup chains might become a bit more complex:

Firstly, to run a docker container, it is necessary to have the corresponding image at hand. Secondly, when creating images from non-public repositories, it is necessary to provide authentication tokens.

In such cases, prepending the startup commands with the parts necessary to login at some private docker registry or pulling some image solves the issue:

[Unit]
[...]
[Service]
[...]
ExecStartPre=/usr/bin/docker login \
                             --username=<uname> \
                             --password=<cred> \
                             --email=<mailaddr> \
                             <URL>
ExecStartPre=/usr/bin/docker pull <image>
ExecStart=/usr/bin/docker start <...>
[...]

When aiming at minimal downtime in case of machine failures, it might be considered to hold images for containers ready at every machine. Logging in and pulling then might be separated into a second (oneshot?) service and scheduled globally.

With the primary service now dependent, a unit could be refactored as such:

[Unit]
After=image-pull.service
Depends=image-pull.service
[Service]
[...]