Setting Up a CoreOS Cluster
In this first post of my cluster-series, I will explain the design constraints of cloud operations and how to use CoreOS and etcd to setup an arbitrary number of nodes.
These nodes then form a cluster, i.e., they know about their cluster membership and communicate their own state using a dedicated and shared key-value store.
I will start with some theoretical remarks on cloud operations, will then proceed to the setup of a CoreOS cluster with vagrant and conclude with those “special” cases which actually are those relevant to productive operations.
- “Special” Cases
Cloud Operating Systems
(Do not google that term, google ranking does not necessarily reflect correctness as being shouted out loud does not make statement more true.)
Without deep-diving into details, how to deal with state is what sets cloud apart from traditional systems. State is difficult, viz. costly to share and to migrate and easily, but also costly, lost. Insuring against state loss in the presence of machine failure is a classical reason for building distributed systems.
State which must not be lost will then need to be replicated over multiple and mutually independent systems, viz., hosts. State which cannot be replicated over many systems for practical reasons will be held on especially protected systems. State which can be lost without consequences will be lost.
Operating systems for the cloud are designed to enable such a division: Important state will be replicated over many or placed on protected, but “expensive” stateful systems. This leaves most parts of the “worker nodes” essentially stateless.
CoreOS & etcd
CoreOS is a Linux-based operating system aiming at nearly stateless cloud operations. It can be operated as an installed system with little OS state or completely stateless when booted over the network by PXE-or similar facilities.
In both cases, configuration is supplied by a YAML-formatted file, which can well be downloaded on the fly from an arbitrary http endpoint.
Being a cloud OS, CoreOS is meant to be deployed to form machine clusters. (Be careful with the term cluster here. The term is widely used in cloud settings referring to stateless machines, it is not comparable to a e.g . stateful high-availability zone in a Solaris cluster.)
etcd Cluster State & Discovery
Necessarily, cluster machines need to know about cluster state, i.e., cluster membership and machine addresses. This requirement is met by etcd, which is a distributed, clustered high-available key-value database, which when deploying CoreOS is used to store membership data of participating nodes.
CoreOS cluster discovery connects to the etcd DB over http and publishes cluster membership. How to discover the etcd cluster is passed via the aforementioned cloud-config. Discovery configuration is static and can therefore passed at boot-time, making cluster discovery stateless as well.
Cluster membership cannot be stateless! etcd is a partitioned, but not replicated database. Thinking of DBMSs like ElasticSearch or DynamoDB, one might believe etcd to survive node loss by replication and voting on dataset correctness.
With etcd, this is not the case. Data on etcd is partitioned and if a etcd node’s data is lost, that etcd cluster has lost data. So, a CoreOS cluster whose membership data has been (partially) lost is broken and a statelessly booted CoreOS Cluster (iPXE) cannot hold it’s own cluster data, but must rely on external, stateful nodes.
etcd Cluster composition
So, to provision a CoreOS cluster, it is necessary to
- own/have control/etc. over a number of (virtual) machines,
- be able to boot cloud images or boot over the network,
- be able to inject CoreOS configuration files and
- be able to access some stateful etcd cluster state DB.
Do not blindly copy&paste.
I suggest to start with provisioning VirtualBoxes in vagrant. This satisfies having virtual machines and booting cloud images (vagrant boxes in this case). Vagrant itself is made to inject files into boxes. For development purposes, is is possible to generate discovery endpoints at discovery.etcd.io over https.
Generate Discovery Endpoints
generates a discovery endpoint with slots for three cluster nodes and return a discovery endpoint.
This is only a discovery endpoint. After all
slots have been taken, i.e. all hosts have registered at this endpoint,
cluster membership data is stored on
/var/lib/etcd on the respective hosts. Vagrant boxes are stateful, so it will not hurt (yet).
If you later boot statelessly over the network, rebooting will break the cluster!
Pull in the vagrant project
.sample from both
user-data.sample. The resulting files require - for the start - little customization.
Deploy a Basic cloud-config
user-data from the
vagrant project is used as the cloud-config file for node configuration.
It needs to include configuration for etcd, fleet and systemd. etcd is the distributed key-value store. fleet manages applications. Units define systemd services managed by the service management facility.
(Ports 4001 and 7001 found elsewhere are legacy ports for etcd v1. I suggest leaving them out.)
Most important is the discovery endpoint for etcd to find other cluster members - it needs to be uniquely generated for each new cluster. The various URLs define how to connect to other etcd instances. More configurables are documented here.
fleet manages application or service distribution. fleet will be covered later.
The units define which system service need to be started at boot-time. Operators may need special services, which can be included here. etcd is mandatory for clustering machines, the fleet service for deploying clustered applications.
flannel as found in many tutorials is an overlay network manager and not required for the operational scale discussed here. For educational purposes, here it is best left out.
Further cloud-config customization
It is possible to pass a multitude of configuration options with cloud-config. More details can be found here.
Write-files for instance allows to specify path, content and permissions of arbitrary files. Authorized ssh-keys are rather important to access your systems, it usually not a good idea to leave that out.
With configuration in place (
user-data), provision a standard CoreOS cluster of three instances with
While the cluster-thingy might not be very useful yet, it is instructive to explore etcd via shell. etcdctl describes to most usual operations. For the time being, it is sufficient to explore a bit. Enter one of the cluster machines with
There, list the directory and
retrieve a value in JSON notation:
List the cluster state (directories beginning with ‘_’ are hidden)
and retrieve the state of a specific machine:
Having a cluster of operating systems is on it’s own not very useful. While operating systems and distributed systems have their own aesthetics, it takes a special personality to appreciate their inherent beauty. Most people actually desire to run applications instead. That point will be deferred entirely non-judgementally to the second post.
Some “special” cases - which IMHO are those relevant to actual productive cluster operations - will conclude this post.
Booting from iPXE
Stateful machines are expensive to manage, which builds the case for stateless nodes. They are best booted over the network and no installation is committed to persistent storage. On malfunction or upgrade, they are just rebooted.
To boot over the network, CoreOS can be chainloaded with an iPXE-script from an arbitrary http endpoint. Passing the http endpoint for the cloud-config to the CoreOS kernel is sufficient.
Then, the file at that endpoint will be applied by
Dynamically Creating a cloud-config.yaml
Some configuration options may be machine specific. So, one YAML cloud-config cannot cater for all. Some (primitive) dynamic substitution solves the case:
The construction of the script is highly specific to the cloud environment you are operating in, as rDNS might not be setup or NICs might be differently named.
Discovery Mechanisms in Stateless Operations
Most tutorials - I have found none - entirely ignore cluster node life cycle management in stateless operation. Crashes and reboots loose machine identity and cluster data state in common configurations.
A discovery DB reserves slots for every planned cluster member, thus determining cluster size. Nodes will register and block a slot. When rebooting a stateless machine, it will acquire a new identity, but cannot re-join the given etcd cluster, as no free slots remain.
Stateful etcd cluster
In principle, it is possible to clear slots at an etcd discovery endpoint. Discovery endpoint access will then need to be secured, which introduces a bunch non-trivial aspects into an otherwise rather straightforward setup.
Then, it is easier to bootstrap a cluster statically and supply the addresses of a stateful etcd cluster, which will then manage cluster membership and state sharing for stateless nodes. The stateless node themselves are not members of the DB cluster and data is accessed by proxy.
Then, machines can leave and re-join the cluster.
Seeding a (Stateful) Cluster
For building stateful clusters without using discovery, It is advisable to start with a one-machine cluster and then extend that cluster, converging the cloud-config to an eventual state:
After the first member has been successfully set up, the initial cluster
sting will be step-wise expanded to include all members of the cluster in
it’s targeted state. Using
and an example from a vagrant cluster, is is possible
to add another member to the cluster - configuration parameters, albeit in the wrong format for cloud-config, are given as result.
Iterate until the desired cluster size and thus the eventual state of the cloud-config has been reached.
As long as node names and IPs are stable over reboots and the nodes to not loose data, clusters will survive reboots.
Discovery via DNS
etcd can use DNS SRV records to query node’s addresses. The
initial-cluster: statement can then be replaced by
With the following zone
and corresponding A-records, it is possible to supply machines with a responsible etcd cluster, allowing a cluster of stateless nodes to adapt dynamically.