Preliminaries

In this first post of my cluster-series, I will explain the design constraints of cloud operations and how to use CoreOS and etcd to setup an arbitrary number of nodes.

These nodes then form a cluster, i.e., they know about their cluster membership and communicate their own state using a dedicated and shared key-value store.

I will start with some theoretical remarks on cloud operations, will then proceed to the setup of a CoreOS cluster with vagrant and conclude with those “special” cases which actually are those relevant to productive operations.

Principles

Cloud Operating Systems

(Do not google that term, google ranking does not necessarily reflect correctness as being shouted out loud does not make statement more true.)

Without deep-diving into details, how to deal with state is what sets cloud apart from traditional systems. State is difficult, viz. costly to share and to migrate and easily, but also costly, lost. Insuring against state loss in the presence of machine failure is a classical reason for building distributed systems.

State which must not be lost will then need to be replicated over multiple and mutually independent systems, viz., hosts. State which cannot be replicated over many systems for practical reasons will be held on especially protected systems. State which can be lost without consequences will be lost.

Operating systems for the cloud are designed to enable such a division: Important state will be replicated over many or placed on protected, but “expensive” stateful systems. This leaves most parts of the “worker nodes” essentially stateless.

CoreOS & etcd

CoreOS is a Linux-based operating system aiming at nearly stateless cloud operations. It can be operated as an installed system with little OS state or completely stateless when booted over the network by PXE-or similar facilities.

In both cases, configuration is supplied by a YAML-formatted file, which can well be downloaded on the fly from an arbitrary http endpoint.

Being a cloud OS, CoreOS is meant to be deployed to form machine clusters. (Be careful with the term cluster here. The term is widely used in cloud settings referring to stateless machines, it is not comparable to a e.g . stateful high-availability zone in a Solaris cluster.)

etcd Cluster State & Discovery

Necessarily, cluster machines need to know about cluster state, i.e., cluster membership and machine addresses. This requirement is met by etcd, which is a distributed, clustered high-available key-value database, which when deploying CoreOS is used to store membership data of participating nodes.

CoreOS cluster discovery connects to the etcd DB over http and publishes cluster membership. How to discover the etcd cluster is passed via the aforementioned cloud-config. Discovery configuration is static and can therefore passed at boot-time, making cluster discovery stateless as well.

Cluster membership cannot be stateless! etcd is a partitioned, but not replicated database. Thinking of DBMSs like ElasticSearch or DynamoDB, one might believe etcd to survive node loss by replication and voting on dataset correctness.

With etcd, this is not the case. Data on etcd is partitioned and if a etcd node’s data is lost, that etcd cluster has lost data. So, a CoreOS cluster whose membership data has been (partially) lost is broken and a statelessly booted CoreOS Cluster (iPXE) cannot hold it’s own cluster data, but must rely on external, stateful nodes.

etcd Cluster composition

So, to provision a CoreOS cluster, it is necessary to

  1. own/have control/etc. over a number of (virtual) machines,
  2. be able to boot cloud images or boot over the network,
  3. be able to inject CoreOS configuration files and
  4. be able to access some stateful etcd cluster state DB.

Practice

Do not blindly copy&paste.

I suggest to start with provisioning VirtualBoxes in vagrant. This satisfies having virtual machines and booting cloud images (vagrant boxes in this case). Vagrant itself is made to inject files into boxes. For development purposes, is is possible to generate discovery endpoints at discovery.etcd.io over https.

Generate Discovery Endpoints

curl -w "\n" 'https://discovery.etcd.io/new?size=3'
https://discovery.etcd.io/6a28e078895c5ec737174db2419bb2f3

generates a discovery endpoint with slots for three cluster nodes and return a discovery endpoint.

This is only a discovery endpoint. After all slots have been taken, i.e. all hosts have registered at this endpoint, cluster membership data is stored on /var/lib/etcd on the respective hosts. Vagrant boxes are stateful, so it will not hurt (yet). If you later boot statelessly over the network, rebooting will break the cluster!

Setup vagrant

Pull in the vagrant project

git clone https://github.com/coreos/coreos-vagrant.git
cd coreos-vagrant

Remove the .sample from both config.rb.sample and user-data.sample. The resulting files require - for the start - little customization.

Deploy a Basic cloud-config

The file user-data from the vagrant project is used as the cloud-config file for node configuration.
It needs to include configuration for etcd, fleet and systemd. etcd is the distributed key-value store. fleet manages applications. Units define systemd services managed by the service management facility.

(Ports 4001 and 7001 found elsewhere are legacy ports for etcd v1. I suggest leaving them out.)

#cloud-config
---
coreos:
  etcd2:
    discovery: https://discovery.etcd.io/6a28e078895c5ec737174db2419bb2f3
    advertise-client-urls: http://$private_ipv4:2379
    initial-advertise-peer-urls: http://$private_ipv4:2380
    listen-client-urls: http://0.0.0.0:2379
    listen-peer-urls: http://$private_ipv4:2380
  fleet:
    public-ip: "$public_ipv4"
  units:
  - name: etcd2.service
    command: start
  - name: fleet.service
    command: start

Most important is the discovery endpoint for etcd to find other cluster members - it needs to be uniquely generated for each new cluster. The various URLs define how to connect to other etcd instances. More configurables are documented here.

fleet manages application or service distribution. fleet will be covered later.

The units define which system service need to be started at boot-time. Operators may need special services, which can be included here. etcd is mandatory for clustering machines, the fleet service for deploying clustered applications.

flannel as found in many tutorials is an overlay network manager and not required for the operational scale discussed here. For educational purposes, here it is best left out.

Further cloud-config customization

It is possible to pass a multitude of configuration options with cloud-config. More details can be found here.

write_files:
- path: "/etc/hosts"
  permissions: '0644'
  owner: root
  content: |
    192.168.2.21 host.on.private.net
ssh_authorized_keys:
  - "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC0g+ZTxC7weoIJLUafOgrm+h..."

Write-files for instance allows to specify path, content and permissions of arbitrary files. Authorized ssh-keys are rather important to access your systems, it usually not a good idea to leave that out.

With configuration in place (user-data), provision a standard CoreOS cluster of three instances with

vagrant up

Using etcdctl

While the cluster-thingy might not be very useful yet, it is instructive to explore etcd via shell. etcdctl describes to most usual operations. For the time being, it is sufficient to explore a bit. Enter one of the cluster machines with

vagrant ssh core-03

There, list the directory and

core@core-03 ~ $ etcdctl ls  --recursive /
/coreos.com
/coreos.com/updateengine
/coreos.com/updateengine/rebootlock
/coreos.com/updateengine/rebootlock/semaphore

retrieve a value in JSON notation:

core@core-03 ~ $ etcdctl get /coreos.com/updateengine/rebootlock/semaphore
{
  "semaphore": 1,
  "max": 1,
  "holders": null
}

List the cluster state (directories beginning with ‘_’ are hidden)

core@core-03 ~ $ etcdctl ls  --recursive /_coreos.com
/_coreos.com/fleet
/_coreos.com/fleet/machines
/_coreos.com/fleet/machines/abc772b2338d41ab91cd957bf8e9dcf8
/_coreos.com/fleet/machines/abc772b2338d41ab91cd957bf8e9dcf8/object
/_coreos.com/fleet/machines/c1f0020b881f42529743b5ad52a7d3b9
/_coreos.com/fleet/machines/c1f0020b881f42529743b5ad52a7d3b9/object
/_coreos.com/fleet/machines/e73c8f105d0b49daa1702e3e42565862
/_coreos.com/fleet/machines/e73c8f105d0b49daa1702e3e42565862/object
/_coreos.com/fleet/engine
/_coreos.com/fleet/engine/version
/_coreos.com/fleet/lease
/_coreos.com/fleet/lease/engine-leader

and retrieve the state of a specific machine:

core@core-03 ~ $ etcdctl get /_coreos.com/fleet/machines/e73c8f105d0b49daa1702e3e42565862/object
{
  "ID" : "e73c8f105d0b49daa1702e3e42565862",
  "PublicIP": "172.17.8.101",
  "Metadata": {},
  "Version": "0.11.5"
}

Having a cluster of operating systems is on it’s own not very useful. While operating systems and distributed systems have their own aesthetics, it takes a special personality to appreciate their inherent beauty. Most people actually desire to run applications instead. That point will be deferred entirely non-judgementally to the second post.

“Special” Cases

Some “special” cases - which IMHO are those relevant to actual productive cluster operations - will conclude this post.

Booting from iPXE

Stateful machines are expensive to manage, which builds the case for stateless nodes. They are best booted over the network and no installation is committed to persistent storage. On malfunction or upgrade, they are just rebooted.

To boot over the network, CoreOS can be chainloaded with an iPXE-script from an arbitrary http endpoint. Passing the http endpoint for the cloud-config to the CoreOS kernel is sufficient.

#!ipxe

set base-url http://stable.release.core-os.net/amd64-usr/current
set cloud-config http://<endpoint>

kernel ${base-url}/coreos_production_pxe.vmlinuz cloud-config-url=${cloud-config}
initrd ${base-url}/coreos_production_pxe_image.cpio.gz
boot

Then, the file at that endpoint will be applied by coreos-cloudinit.

Dynamically Creating a cloud-config.yaml

Some configuration options may be machine specific. So, one YAML cloud-config cannot cater for all. Some (primitive) dynamic substitution solves the case:

#!/bin/sh
PUBIP=`ip addr show eth0 | grep -e "inet\s" | awk '{print $2}'`
HOSTNAME=`getent hosts ${PUBIP} | awk '{print $2}'`
[...]
cat > "cloud-config.yaml" <<EOF
#cloud-config
hostname: ${HOSTNAME}
coreos:
  etcd2:
    name: ${HOSTNAME}
    advertise-client-urls: http://${PUBIP}:2379
    initial-advertise-peer-urls: http://${PUBIP}:2380
    listen-client-urls: http://0.0.0.0:2379
    listen-peer-urls: http://${PUBIP}:2380
    initial-cluster: "${HOSTNAME}=http://${PUBIP}:2380"
 [...]
 EOF

The construction of the script is highly specific to the cloud environment you are operating in, as rDNS might not be setup or NICs might be differently named.

Discovery Mechanisms in Stateless Operations

Most tutorials - I have found none - entirely ignore cluster node life cycle management in stateless operation. Crashes and reboots loose machine identity and cluster data state in common configurations.

A discovery DB reserves slots for every planned cluster member, thus determining cluster size. Nodes will register and block a slot. When rebooting a stateless machine, it will acquire a new identity, but cannot re-join the given etcd cluster, as no free slots remain.

Stateful etcd cluster

In principle, it is possible to clear slots at an etcd discovery endpoint. Discovery endpoint access will then need to be secured, which introduces a bunch non-trivial aspects into an otherwise rather straightforward setup.

Then, it is easier to bootstrap a cluster statically and supply the addresses of a stateful etcd cluster, which will then manage cluster membership and state sharing for stateless nodes. The stateless node themselves are not members of the DB cluster and data is accessed by proxy.

#cloud-config
hostname: <HOSTNAME>
ssh_authorized_keys:
  - $SSH_KEY
coreos:
  etcd2:
    listen-client-urls: http://0.0.0.0:2379
    initial-cluster: "<host1>=http://<IP>:2380,<host2>=http://<IP>:2380,<...>"
    proxy: on

Then, machines can leave and re-join the cluster.

Seeding a (Stateful) Cluster

For building stateful clusters without using discovery, It is advisable to start with a one-machine cluster and then extend that cluster, converging the cloud-config to an eventual state:

#cloud-config
hostname: <HOSTNAME>
ssh_authorized_keys:
  - $SSH_KEY
coreos:
  etcd2:
    name: <HOSTNAME>
    advertise-client-urls: http://<PUBLIC_IP>:2379
    initial-advertise-peer-urls: http://<PUBLIC_IP>:2380
    listen-client-urls: http://0.0.0.0:2379
    listen-peer-urls: http://<PUBLIC_IP>:2380
    initial-cluster: "<hostname>=http://<PUBLIC_IP>:2380"
    initial-cluster-state: new

After the first member has been successfully set up, the initial cluster sting will be step-wise expanded to include all members of the cluster in it’s targeted state. Using etcdctl and an example from a vagrant cluster, is is possible

core@core-01 ~ $ etcdctl member list
6ae27f9fa2984b1d: name=core-01 \
                  peerURLs=http://172.17.8.101:2380 \
                  clientURLs=http://172.17.8.101:2379

core@core-01 ~ $ etcdctl member add  core-02 http://172.17.8.102:2380
Added member named core-02 with ID 94ba962b643dc411 to cluster

ETCD_NAME="core-02"
ETCD_INITIAL_CLUSTER="core-01=http://172.17.8.101:2380, \
                      core-02=http://172.17.8.102:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

core@core-01 ~ $ etcdctl member list
6ae27f9fa2984b1d:            name=core-01 \
                             peerURLs=http://172.17.8.101:2380 \
                             clientURLs=http://172.17.8.101:2379
                  
94ba962b643dc411[unstarted]: peerURLs=http://172.17.8.102:2380

to add another member to the cluster - configuration parameters, albeit in the wrong format for cloud-config, are given as result.

#cloud-config
coreos:
  etcd2:
    name: <HOSTNAME>
    <[...]>
    initial-cluster: "<host>=http://<IP>:2380,<host2>=http://<IP>:2380"
    initial-cluster-state: existing

Iterate until the desired cluster size and thus the eventual state of the cloud-config has been reached.

As long as node names and IPs are stable over reboots and the nodes to not loose data, clusters will survive reboots.

Discovery via DNS

etcd can use DNS SRV records to query node’s addresses. The initial-cluster: statement can then be replaced by something like

#cloud-config
hostname: <HOSTNAME>
ssh_authorized_keys:
  - $SSH_KEY
coreos:
  etcd2:
    listen-client-urls: http://0.0.0.0:2379
    discovery-srv: <DNSDOMAIN>
    proxy: on    

With the following zone

<...>
_etcd-server._tcp.example.com. 300 IN SRV 0 0 2380 infra0.example.com.
_etcd-server._tcp.example.com. 300 IN SRV 0 0 2380 infra1.example.com.
_etcd-server._tcp.example.com. 300 IN SRV 0 0 2380 infra2.example.com.
<...>

and corresponding A-records, it is possible to supply machines with a responsible etcd cluster, allowing a cluster of stateless nodes to adapt dynamically.