Setting Up a CoreOS Cluster
February 8, 2016. 2653 words.
Preliminaries
In this first post of my cluster-series, I will explain the design constraints of cloud operations and how to use CoreOS and etcd to setup an arbitrary number of nodes.
These nodes then form a cluster, i.e., they know about their cluster membership and communicate their own state using a dedicated and shared key-value store.
I will start with some theoretical remarks on cloud operations, will then proceed to the setup of a CoreOS cluster with vagrant and conclude with those “special” cases which actually are those relevant to productive operations.
Principles
Cloud Operating Systems
Try not to google that term, google ranking does not necessarily reflect correctness as being shouted out loud does not make statement more true.
Without deep-diving into details, how to deal with state is what sets cloud apart from traditional systems. State is difficult, viz. costly to share and to migrate and easily, but also costly, lost. Insuring against state loss in the presence of machine failure is a classical reason for building distributed systems.
State which must not be lost will then need to be replicated over multiple and mutually independent systems, viz., hosts. State which cannot be replicated over many systems for practical reasons will be held on especially protected systems. State which can be lost without consequences will be lost.
Operating systems for the cloud are designed to enable such a division: Important state will be replicated over many or placed on protected, but “expensive” stateful systems. This leaves most parts of the “worker nodes” essentially stateless.
CoreOS & etcd
CoreOSCoreOS. is a Linux-based operating system aiming at nearly stateless cloud operations. It can be operated as an installed system with little OS state or completely stateless when booted over the network by PXE-or similar facilities.
In both cases, configuration is supplied by a YAML-formatted file, which can well be downloaded on the fly from an arbitrary http endpoint. Being a cloud OS, CoreOS is meant to be deployed to form machine clusters. I advise to be careful with the term cluster here. The term is widely used in cloud settings referring to stateless machines, it is not comparable to a e.g. stateful high-availability zone in a Solaris Veritas cluster.
etcd Cluster State & Discovery
Necessarily, cluster machines need to know about cluster state, i.e., cluster membership and machine addresses. This requirement is met by etcd, which is a distributed, clustered high-available key-value database, which when deploying CoreOS is used to store membership data of participating nodes.
CoreOS cluster discovery connects to the etcd DB over http and publishes cluster membership. How to discover the etcd cluster is passed via the aforementioned cloud-config. Discovery configuration is static and can therefore passed at boot-time, making cluster discovery stateless as well.
Cluster membership cannot be stateless! etcd is a partitioned and replicated, but picky on that special state which defines membership. Thinking of DBMSs like ElasticSearch or DynamoDB, one might believe etcd to survive node loss by replication and voting on dataset correctness.
This is not the case. If the machine identifiers for DMBS-cluster membership change, the etcd cluster will reach an inconsistent state without operator intervention and operator intervention is only possible when the number of consistent nodes left still allows to reach a quorum consensus. So, a statelessly booted CoreOS Cluster (iPXE) cannot hold it’s own cluster data, but must rely on external, stateful nodes.
etcd Cluster composition
So, to provision a CoreOS cluster, it is necessary to
- own/have control/etc. over a number of (virtual) machines,
- be able to boot cloud images or boot over the network,
- be able to inject CoreOS configuration files and
- be able to access some stateful etcd cluster state DB.
Practice
Do not blindly copy&paste.
I suggest to start with provisioning VirtualBoxes in vagrant. This satisfies having virtual machines and booting cloud images (vagrant boxes in this case). Vagrant itself is made to inject files into boxes. For development purposes, is is possible to generate discovery endpoints at discovery.etcd.io
CoreOS Container Linux cluster discovery. over https.
Generate Discovery Endpoints
curl \
-w "\n" \
'https://discovery.etcd.io/new?size=3'
https://discovery.etcd.io/6a28e078895c5ec737174db2419bb2f3
generates a discovery endpoint with slots for three cluster nodes and return a discovery endpoint.
This is only a discovery endpoint. After all slots have been taken, i.e. all hosts have registered at this endpoint, cluster membership data is stored on /var/lib/etcd on the respective hosts. Vagrant boxes are stateful, so it will not hurt (yet). If you later boot statelessly over the network, rebooting will break the cluster!
Setup vagrant
Pull in the vagrant project
git clone https://github.com/coreos/coreos-vagrant.git
cd coreos-vagrant
Remove the .sample
from both config.rb.sample
and user-data.sample
. The resulting files require - for the start - little customization.
Deploy a Basic cloud-config
The file user-data from the vagrant project is used as the cloud-config file for node configuration. It needs to include configuration for etcdetcd Documentation - latest. , fleetfleet Documentation - latest. and systemdSystemd System and Service Manager. . etcd is the distributed key-value store. fleet manages applications. Units define systemd services managed by the service management facility.
#cloud-config
---
coreos:
etcd2:
discovery: https://discovery.etcd.io/6a28e078895c5ec737174db2419bb2f3
advertise-client-urls: http://$private_ipv4:2379
initial-advertise-peer-urls: http://$private_ipv4:2380
listen-client-urls: http://0.0.0.0:2379
listen-peer-urls: http://$private_ipv4:2380
fleet:
public-ip: "$public_ipv4"
units:
- name: etcd2.service
command: start
- name: fleet.service
command: start
Ports 4001 and 7001 found elsewhere are legacy ports for etcd v1. I suggest leaving them out.
Most important is the discovery endpoint for etcd to find other cluster members - it needs to be uniquely generated for each new cluster. The various URLs define how to connect to other etcd instances. More configurables are documented on githubcoreos/etc/Documentation.
fleet manages application or service distribution. fleet will be covered later.
The units define which system service need to be started at boot-time. Operators may need special services, which can be included here. etcd is mandatory for clustering machines, the fleet service for deploying clustered applications.
flannel as found in many tutorials is an overlay network manager and not required for the operational scale discussed here. For educational purposes, here it is best left out.
Further cloud-config customization
It is possible to pass a multitude of configuration options with cloud-config. More details can be found in the upstream documentationUsing Cloud-Config. .
# cloud-config
---
write_files:
- path: "/etc/hosts"
permissions: '0644'
owner: root
content: |
192.168.2.21 host.on.private.net
ssh_authorized_keys:
- "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC0g+ZTxC7weoIJLUafOgrm+h..."
Write-files
for instance allows to specify path, content and permissions of arbitrary files. Authorized ssh-keys are rather important to access your
systems, it usually not a good idea to leave that out.
With configuration in place (user-data), provision a standard CoreOS cluster of three instances with
vagrant up
Using etcdctl
While the cluster-thingy might not be very useful yet, it is instructive to explore etcd via shell. The usual operations are described on githubcoreos/etcd. . For the time being, it is sufficient to explore a bit.
Enter one of the cluster machines with
vagrant ssh core-03
There, list the directory and
core@core-03 ~ $ etcdctl ls --recursive /
/coreos.com
/coreos.com/updateengine
/coreos.com/updateengine/rebootlock
/coreos.com/updateengine/rebootlock/semaphore
retrieve a value in JSON notation:
core@core-03 ~ $ etcdctl get /coreos.com/updateengine/rebootlock/semaphore
{
"semaphore": 1,
"max": 1,
"holders": null
}
List the cluster state (directories beginning with ‘_’ are hidden)
core@core-03 ~ $ etcdctl ls --recursive /_coreos.com
/_coreos.com/fleet
/_coreos.com/fleet/machines
/_coreos.com/fleet/machines/abc772b2338d41ab91cd957bf8e9dcf8
/_coreos.com/fleet/machines/abc772b2338d41ab91cd957bf8e9dcf8/object
/_coreos.com/fleet/machines/c1f0020b881f42529743b5ad52a7d3b9
/_coreos.com/fleet/machines/c1f0020b881f42529743b5ad52a7d3b9/object
/_coreos.com/fleet/machines/e73c8f105d0b49daa1702e3e42565862
/_coreos.com/fleet/machines/e73c8f105d0b49daa1702e3e42565862/object
/_coreos.com/fleet/engine
/_coreos.com/fleet/engine/version
/_coreos.com/fleet/lease
/_coreos.com/fleet/lease/engine-leader
and retrieve the state of a specific machine:
core@core-03 ~ $ etcdctl get /_coreos.com/fleet/machines/e73c8f105d0b49daa1702e3e42565862/object
{
"ID" : "e73c8f105d0b49daa1702e3e42565862",
"PublicIP": "172.17.8.101",
"Metadata": {},
"Version": "0.11.5"
}
Having a cluster of operating systems is on it’s own not very useful. While operating systems and distributed systems have their own aesthetics, it takes a special personality to appreciate their inherent beauty. Well, they’ll mature eventually. Most people actually desire to run applications instead. That point will be deferred entirely non-judgementally to the second post.
“Special” Cases
Some “special” cases - which IMHO are those relevant to actual productive cluster operations - will conclude this post.
Booting from iPXE
Stateful machines are expensive to manage, which builds the case for stateless nodes. They are best booted over the network and no installation is committed to persistent storage. On malfunction or upgrade, they are just rebooted.
To boot over the network, CoreOS can be chainloaded with an iPXE-script from an arbitrary http endpoint. Passing the http endpoint for the cloud-config to the CoreOS kernel is sufficient.
#!ipxe
set base-url http://stable.release.core-os.net/amd64-usr/current
set cloud-config http://<endpoint>
kernel ${base-url}/coreos_production_pxe.vmlinuz cloud-config-url=${cloud-config}
initrd ${base-url}/coreos_production_pxe_image.cpio.gz
boot
Then, the file at that endpoint will be applied by coreos-cloudinit.
Dynamically Creating a cloud-config.yaml
Some configuration options may be machine specific. So, one YAML cloud-config cannot cater for all. Some (primitive) dynamic substitution solves the case:
#!/bin/sh
PUBIP=`ip addr show eth0 | grep -e "inet\s" | awk '{print $2}'`
HOSTNAME=`getent hosts ${PUBIP} | awk '{print $2}'`
[...]
cat > "cloud-config.yaml" <<EOF
#cloud-config
hostname: ${HOSTNAME}
coreos:
etcd2:
name: ${HOSTNAME}
advertise-client-urls: http://${PUBIP}:2379
initial-advertise-peer-urls: http://${PUBIP}:2380
listen-client-urls: http://0.0.0.0:2379
listen-peer-urls: http://${PUBIP}:2380
initial-cluster: "${HOSTNAME}=http://${PUBIP}:2380"
[...]
EOF
The construction of the script is highly specific to the cloud environment you are operating in, as rDNS might not be setup or NICs might be differently named.
Discovery Mechanisms in Stateless Operations
Most tutorials - I have found none - entirely ignore cluster node life cycle management in stateless operation. Crashes and reboots loose machine identity and cluster data state in common configurations.
A discovery DB reserves slots for every planned cluster member, thus determining cluster size. Nodes will register and block a slot. When rebooting a stateless machine, it will acquire a new identity, but cannot re-join the given etcd cluster, as no free slots remain.
Stateful etcd cluster
In principle, it is possible to clear slots at an etcd discovery endpoint. Discovery endpoint access will then need to be secured, which introduces a bunch non-trivial aspects into an otherwise rather straightforward setup.
Then, it is easier to bootstrap a cluster statically and supply the addresses of a stateful etcd cluster, which will then manage cluster membership and state sharing for stateless nodes. The stateless node themselves are not members of the DB cluster and data is accessed by proxy.
#cloud-config
hostname: <HOSTNAME>
ssh_authorized_keys:
- $SSH_KEY
coreos:
etcd2:
listen-client-urls: http://0.0.0.0:2379
initial-cluster: "<host1>=http://<IP>:2380,<host2>=http://<IP>:2380,<...>"
proxy: on
Then, machines can leave and re-join the cluster.
Seeding a (Stateful) Cluster
For building stateful clusters without using discovery, it is advisable to start with a one-machine cluster and then extend that cluster, converging the cloud-config to an eventual state:
#cloud-config
hostname: <HOSTNAME>
ssh_authorized_keys:
- $SSH_KEY
coreos:
etcd2:
name: <HOSTNAME>
advertise-client-urls: http://<PUBLIC_IP>:2379
initial-advertise-peer-urls: http://<PUBLIC_IP>:2380
listen-client-urls: http://0.0.0.0:2379
listen-peer-urls: http://<PUBLIC_IP>:2380
initial-cluster: "<hostname>=http://<PUBLIC_IP>:2380"
initial-cluster-state: new
After the first member has been successfully set up, the initial cluster listing will be step-wise expanded to include all members of the cluster in it’s targeted state. Using etcdctl
and an example from a vagrant cluster, is is possible
core@core-01 ~ $ etcdctl member list
6ae27f9fa2984b1d: name=core-01 \
peerURLs=http://172.17.8.101:2380 \
clientURLs=http://172.17.8.101:2379
core@core-01 ~ $ etcdctl member add core-02 http://172.17.8.102:2380
Added member named core-02 with ID 94ba962b643dc411 to cluster
ETCD_NAME="core-02"
ETCD_INITIAL_CLUSTER="core-01=http://172.17.8.101:2380, \
core-02=http://172.17.8.102:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
core@core-01 ~ $ etcdctl member list
6ae27f9fa2984b1d: name=core-01 \
peerURLs=http://172.17.8.101:2380 \
clientURLs=http://172.17.8.101:2379
94ba962b643dc411[unstarted]: peerURLs=http://172.17.8.102:2380
to add another member to the cluster - configuration parameters, albeit in the wrong format for cloud-config, are given as result.
#cloud-config
coreos:
etcd2:
name: <HOSTNAME>
<[...]>
initial-cluster: "<host>=http://<IP>:2380,<host2>=http://<IP>:2380"
initial-cluster-state: existing
Iterate until the desired cluster size and thus the eventual state of thecloud-config has been reached.
As long as node names and IPs are stable over reboots and the nodes to not loose data, clusters will survive reboots.
Discovery via DNS
etcd can use DNS SRV records to query node’s addressesCoreOS Clustering Guide. § on DNS Discovery. . The initial-cluster: statement can then be replaced by something like
#cloud-config
hostname: <HOSTNAME>
ssh_authorized_keys:
- $SSH_KEY
coreos:
etcd2:
listen-client-urls: http://0.0.0.0:2379
discovery-srv: <DNSDOMAIN>
proxy: on
With the following zone
<...>
_etcd-server._tcp.example.com. 300 IN SRV 0 0 2380 infra0.example.com.
_etcd-server._tcp.example.com. 300 IN SRV 0 0 2380 infra1.example.com.
_etcd-server._tcp.example.com. 300 IN SRV 0 0 2380 infra2.example.com.
<...>
and corresponding A-records, it is possible to supply machines with a responsible etcd cluster, allowing a cluster of stateless nodes toadapt dynamically.