Provisioning Applications on CoreOS Clusters
Having set up a working CoreOS cluster, the intrepid, viz., naive might proceed with installing applications on or even admitting users to their clustered systems.
Having seen developers and users alike wreak havoc on the most beautiful
systems, I strongly advise against such folly. However, if you are bribed,
coerced or otherwise motivated to do so, it is indeed possible. Employing a
combination of docker and systemd, it is rather easy, I daresay.
A point of warning: This post is very long. Because deploying applications in CoreOS requires sensible packaging, scheduling and service auto-discovery, HA setups and fine-grained machine selection are practically important and I still do not want to be slack on theory, I did not manage to be shorter.
- Application Containerization
- Deploying applications in CoreOS
- Service Auto-Discovery
- Special Cases
With ever increasing power of computer hardware, running applications in virtualized environments has become the most common workload. Typically then, a physical machine is partitioned into several virtual systems by a piece of software called hypervisor.
Said hypervisor can then then delegate each partition to a separate operating system, which then provides kernel and system libraries for application software.
Such application software conversely depends on application and system libraries and an operating system kernel. Integration of such software into a given set of application and system library dependencies - of different versions - soon became a, quod googlet dependency hell, a decidedly non-trivial task.
Quickly, it has been observed that when deploying application software in separate virtualized containers, it is possible to eliminate these problems. Packaging would then simplify to serializing application, all dependencies and the operating system kernel into an archive format and trivially unpacking and executing this set on a target hypervisor platform.
Archives needed to be very large, though, as the needed every dependency from kernel, libc, any libraries and the application. For cloud operations which need depend on rapidly setting up and destroying machines scaled to current application load, that approach would not work.
The situation changes when eliminating the hypervisor from the software stack. This is possible when using a technique from the operating system domain named container virtualization. This technology is not really young anymore, it hast been around since the very late 90ies in FreeBSD and Solaris.
Container virtualization revolves around the idea of running processes in “private namespaces” provided by the operating system kernel which the processes cannot “leave”. In the absence of a hypervisor and accordingly, hardware emulation, some prefer the term “encapsulation” to the term “virtualization”.
An application process thus encapsulated now maps to a set of application and library binaries. As long as these binaries have been compiled against compatible kernels and libcs, a packaged appliance will loose kernel and system libraries and only need to include those dependencies still missing.
Now, it can be observed that many applications have common dependencies and the problems of aforementioned dependency hell stems from a relatively small subset.
Then, in the presence of encapsulation, starting from the “most recent common ancestor”, it is possible to “layer” the package. The layer at the bottom would then include libc and those libraries which every package depends on. The next layers include packages most, but not all applications depend on and so forth. This drastically reduces package size and increases hard disk and even memory sharing.
Reducing package size also reduces time required to migrate the package over the network, which is one requirement for cloud deployments.
Deploying applications in CoreOS
As an OS for the cloud, CoreOS requires applications to utilize resources efficiently. It also needs to be able to run diverse and possibly mutually incompatible applications on the same operating system platform. Process encapsulation and layered packaging turned out to be a crucial point.
Having to expensively install and configure applications is a deal-breaker in the cloud setting. So, not applications, but appliances are deployed in CoreOS installations. In addition, in a cloud setting, operators not necessarily care whether applications have been installed. Instead, operators care whether services are available somewhere.
fleetd and CoreOS
Consequently, CoreOS does not even have a traditional package manager like the dpkg- or rpm-suites on Debian or RedHat respectively. Instead, docker and fleetd can be thought of package managers for CoreOS cloud ops:
docker (or more recently rkt) are used to package and run applications - services. fleetd is the CoreOS service scheduler: It interfaces the cluster via etcd and determines the placement on specific machines.
To do so, fleetd plugs the new Linux service management facility systemd: Appliances are started and stopped as systemd units. systemd manifests - unit files - are stored in the same distributed key-value store CoreOS uses to store cluster information.
systemd units for CoreOS
Syntactically, there is no difference between local systemd unit files and units designated for cluster use:
fleetctl, such a unit file
can be published, scheduled and executed to/on a CoreOS cluster either on a
machine running fleetd itself or via ssh tunnel:
fleetctl can be run locally on a
cluster member machine or remotely via ssh-tunneling.)
Then, listing the services on the cluster
shows the service being scheduled to a machine in that cluster.
fleetctl yields a set of
other commands which are sufficiently documented elsewhere. From the
subcommands, it is important to briefly touch the semantics of the sequence
load. “Submitting” means “loading” the unit into the cluster and
“loading” means scheduling the unit on some machine. The language may be
distracting at first.
Scheduling and running services cluster-wide is not very useful if services cannot be used and to be used they need to be found.
The pattern employed in such clusters is to register services at startup at some centrally accessible facility, which is in the case of CoreOS etcd.
When registering, it is necessary to ensure the presence of
etcdctl mkdir ... will fail if the directory is present and create it when not. Failure does
not have any consequences, as the command is prepended with an ‘-‘ character.
The service is then registered (%n will expand to the unit’s name and %H to the host’s name in systemd). Where to find an publicly accessible IP is highly system dependent.
Accessing Registered Services
Then, it is possible to provision a load-balancer or a reverse proxy at some very stable IP-endpoint. Note again the pattern to have valuable machines which need to be taken care of (IP-endpoints) and dispensable machines (services).
On IP-endpoints, etcd is watched and some load-balancing or reverse-proxying server reconfigured on etcd changes. Most commonly, nginx or haproxy are used to turn machines into dedicated http-switches.
While in a trivial setup, watching etcd and reconfiguring a config file
might be accomplished using
sed and some lines of
ksh, dedicated software does exist, which plugs
etcd with templates.
Think of a haproxy configuration: It is then possible to template the config with a tool named confd:
Configuring confd to watch a specific etcd path and to reload haproxy on
changes, it is possible to modify
haproxy.cfg using a liquid-like template:
Having an instance of haproxy running under a stable IP and having that instance reconfigure dynamically using confd solves the problem of accessing dynamically scheduled services in a cluster.
To have more than one process running inside one docker container, it is
possible to set systemd’s
/usr/sbin/init. An example. cobbled together but working, can be found
on my github account.
Remember that to run systemd inside docker containers requires to yield some isolation properties:
fleet-specific options are detailed in the documentation upstream here. It is nevertheless necessary to touch some here.
Declaring a unit file a “template” allows for multiple instances of some
service being run on the cluster’s machines. This is accomplished by
appending an ‘@’-character to the name part of the unit file’s filename as in
nginx@.service. Then, services are submitted
using an appendix in arabic numbers as in
which schedules and executes two instances of the same unit file.
systemd will take care of unit failures local to the machine by restarting the service. fleet will re-schedule to service to a different machine when a machine leaves the cluster, either by controlled termination of fleetd or by crashing.
Controlling Scheduling on Machines
Scheduling can be refined appending an
[X-Fleet] part to the unit file
allow to precisely pin a service on a machine.
allows to run a service on a machine where a metadata key value pair has been set accordingly.
to prevent co-location with another service and
will schedule and execute the service on all cluster machines.
Following appendix e.g. prevents two nginx HA services alongside each other (to ensure against cluster machine failure). Machines which are designated as endpoints are excluded, the idea is to have dedicated IP and/or TCP switching machines running a load balancer instance.
Irrespective of cluster or non-cluster operation, when using systemd to run docker containers unattended, startup chains might become a bit more complex:
Firstly, to run a docker container, it is necessary to have the corresponding image at hand. Secondly, when creating images from non-public repositories, it is necessary to provide authentication tokens.
In such cases, prepending the startup commands with the parts necessary to login at some private docker registry or pulling some image solves the issue:
When aiming at minimal downtime in case of machine failures, it might be considered to hold images for containers ready at every machine. Logging in and pulling then might be separated into a second (oneshot?) service and scheduled globally.
With the primary service now dependent, a unit could be refactored as such: