- Overview
- Programs
- IP Address Management (IPAM)
- Routing
- On-demand NAT for egress traffics
- Garbage Collection
- Upgrading from v1
- Diagrams
- Custom Resource Definitions (CRDs)
Coil is a CNI plugin for Kubernetes. It is designed with modularity and performance in mind. This document describes the background and goals of the version 2 of Coil.
Coil v1 was our first attempt to create something for Kubernetes. When we started to create Coil, we did not know how to create operators, how to elect a leader controller, nor how to expose metrics for Prometheus.
Now that we have learned how to do these things, and want to add rich features such as on-demand NAT for egress traffics, it is the time to revamp the implementation.
- Use CRD to configure Coil instead of CLI tool
coilctl
. - Store status data in
kube-apiserver
instead of etcd. - Decouple the pool name and namespace name. Use annotations to specify the pool to be used.
- Use gRPC instead of REST for local inter-process communication.
- Use CRD for communication between the controller and the node pods.
- Use leader-election of the controller for better availability.
- Export Prometheus metrics. Specifically, usage stats of the address pools.
- Add on-demand NAT for egress traffics.
- Keep the rest of the v1 architecture and concepts.
- IP address management (IPAM)
- Multiple pools of IP addresses
- Intra-node routing
- Loose coupling with external routing program
- On-demand NAT for egress traffics using Foo over UDP
- Exporting metrics for Prometheus
Coil v2 will consist of the following programs:
coil-controller
: Kubernetes controller managing custom resources.coild
: Daemon program running on nodes.coil
: CNI interface that delegates requests fromkubelet
tocoild
.coil-egress
: Administration program running in Egress pods.
To assign IP addresses to pods, coil has pools of IP addresses. Each pod can request an IP address assignment from a single pool. For example, a pod may request a globally routable IP address from a pool of global IP addresses.
Choosing the pool should be controlled carefully by the cluster admins. If a user can freely choose the pool of global IP addresses, the user could easily consume the limited addresses or expose the pod to the Internet without protection.
Therefore, there is at most one default pool that should have non-global IP addresses. To use non-default pools, the admins should add an annotation to the namespace of the pod. The annotation value should the name of the pool that is used to assign IP addresses to the pods in the namespace.
To make things simple, the default pool is the pool whose name is default
.
To reduce the number of advertised routes, addresses in an address pool are divided into fixed-size blocks. These blocks are called address blocks, and assigned to nodes. Since all IP addresses in an address block are routed to the same node, only one route per address block need to be advertised.
For example, if an address pool defines that the size of an address block is 25, coil-controller
will carve an address block for IPv4 with /27
subnet mask out of the pool, and assigns it to a node.
In general, avoiding immediate reuse of IP addresses is better not to confuse other software or components.
To avoid such immediate reuse, coil-controller
remembers the last used address, and it assigns the address from the next address.
The same problem may occur when we use address blocks of the size /32
.
In this case, there is a high chance of reusing the same address immediately.
However, the address block of the size /32
is usually used for public addresses.
Public IP addresses are not allocated and released frequently.
Thus, we don't care this situation.
coild
picks up a free IP address from an address block, assigns it to a pod.
To make things fast, coild
builds an in-memory cache of allocated IP addresses by scanning the host OS network namespace at startup.
The basic flow looks like:
- Receive an assignment request from
coil
via gRPC over UNIX domain socket. - Determine which pool should be chosen for the pod.
- If the node has an address block with free IP addresses of the pool, skip to 5.
- Request and wait assignment of a new address block of the pool.
- Pick a free IP address out of the block.
- Return the picked address to
coil
.
Coil programs only intra-node routing between node OS and pods on the node. As to inter-node routing, coil exports routing information to an external routing daemon such as BIRD.
This design allows users to choose the routing daemon and protocol on their demands.
For each pod running on a node, coil
creates a veth pair. One end of the pair becomes eth0
interface of pod containers, and the other end is used in the host OS.
In the host OS network namespace, coil
inserts an entry into the kernel routing table to route packets to the pod. It also assigns a link-local address to the host-side end of veth. For IPv4, the address is 169.254.1.1
.
In the pod network namespace, coil
inserts the default gateway like this:
ip route add 169.254.1.1/32 dev eth0 scope link
ip route add default via 169.254.1.1 scope global
Coil does not use the bridge virtual interface.
For each allocated address block, coild
inserts a route into an unused kernel routing table.
This table can be referenced by an external routing program such as BIRD. Users can configure the external program to advertise the routes with the protocol of their choice.
In environments where nodes and pods only have private IP addresses, communication to the external networks requires source network address translation (SNAT). If the underlying network provides SNAT, there is no problem. But if not, we need to somehow implement SNAT on Kubernetes.
Coil provides a feature to implement SNAT routers on Kubernetes for these kind of environments. This feature is on-demand because admins can allow only a subset of pods to use SNAT routers.
SNAT routers can be created in the following steps:
- Prepare an address pool whose IP addresses can communicate with one (or more) external networks.
- Prepare a namespace associated with the address pool.
- Run pods in the namespace. These pods work as SNAT routers.
- Configure iptables rules in the router pods for SNAT.
The iptables rule looks like:
iptables -t nat -A POSTROUTING ! -s <pod address>/32 -o eth0 -j MASQUERADE
Since the underlay network cannot route packets whose destination addresses are in the external network(s), the remaining problem is how to route packets originated from client pods to router pods. The solution is to use tunnels.
Linux has a number of tunneling options. Among others, we choose Foo over UDP (FoU) because it has good properties:
- FoU encapsulates packets in UDP.
- UDP packets can be processed efficiently in modern network peripherals.
- UDP servers can be made redundant with Kubernetes' Service.
- FoU can tunnel both IPv4 and IPv6 packets.
FoU encapsulates other IP-based tunneling protocol packet into a UDP packet. The most simple pattern is to encapsulate IP in IP (IPIP) packets.
This can be configured by 1) creating IPIP tunnel device with FoU encapsulation option, and 2) adding FoU listening port as follows:
$ sudo ip link add name tun1 type ipip ttl 225 \
remote 1.2.3.4 local 5.6.7.8 \
encap fou encap-sport auto encap-dport 5555
$ sudo ip fou add port 5555 ipproto 4 # 4 means IPIP protocol
To send an FoU encapsulated packet to an external network 11.22.33.0/24
via the SNAT router 1.2.3.4
, the packet need to be routed to tun1
link. This can be done by, for example, ip route add 11.22.33.0/24 dev tun1
.
To receive an FoU packet, the packet needs to be delivered to the port 5555. The kernel then strips FoU header from the packet and try to find a matching IPIP link because of ipproto 4
. If no matching IPIP link is found, the packet will be dropped. If found, the encapsulated body of the packet will be processed.
For tunneling IPv6 packets over IPv6, the protocol number needs to be changed from 4
to 41
and the link type from ipip
to ip6tnl
with mode ip6ip6
.
The transmission between client pods and the SNAT router needs to be bidirectional because otherwise packets returning from the external network via SNAT router to the client may not reach the final destination for the following reasons.
- The returning packet's source address can be a global IP address, and such packets are often rejected by NetworkPolicy.
- If the packet is TCP's SYN-ACK, it is often dropped by the underlay network because there seems no corresponding SYN packet. Note that SYN packet was sent through FoU tunnel.
If the SNAT routers are behind Kubernetes Service, the IPIP tunnel on the client pod is configured to send packets to the Service's ClusterIP. Therefore, the FoU encapsulated packet will have the ClusterIP as the destination address.
Remember we need bidirectional tunneling. If the returning packet has the SNAT router's IP address as the source address, the packet does not match the IPIP tunnel configured for the Service's ClusterIP. We setup a flow based IPIP tunnel device to receive such the returning packet as well as the IPIP tunnel device with FoU encapsulation option. Otherwise, clients will return ICMP destination unreachable packets. This flow based IPIP tunnel devices work as catch-all fallback interfaces for the IPIP decapsulation stack.
For example, a NAT client(10.64.0.65:49944
) sends an encapsulated packet from CLusterIP 10.68.114.217:5555
, and a return packet comes from a router Pod(10.72.49.1.59203
) to the client.
The outgoing packet will be encapsulated by the IPIP tunnel device with FoU encapsulation option, and the incoming packet will be received and decapsulated by the flow based IPIP tunnel device.
10.64.0.65.49944 > 10.68.114.217.5555: UDP, length 60
10.72.49.1.59203 > 10.64.0.65.5555: UDP, length 60
Before coil v2.4.0, we configured a fixed source port 5555 for FoU encapsulation devices so that kube-proxy
or Cilium kube-proxy replacement
can do the reverse SNAT handling.
The transmit and receive sides have been separated and the communication can be asymmetric as the example above shows. We were relying on the fixed source port to handle the reverse SNAT.
This fixed source port approach causes the following problems:
- Traffic from NAT clients to router Pods can't be distributed when users use Coil with a proxier that selects a backend based on the flow hash such as
Cilium
- When a router Pod is terminating, traffic from NAT clients to the route Pod cant' be switched until the Pod is finally removed. This problem happens with the Graceful termination of
Cilium kube-proxy replacement
.
We encourage users to use fouSourcePortAuto: true
setting to avoid these problems.
To tunnel TCP packets, we need to keep sending the packets to the same SNAT router.
This can be achieved by setting Service's spec.sessionAffinity
to ClientIP
.
Therefore, Coil creates a Service with spec.sessionAffinity=ClientIP
for each NAT gateway.
It's also notable that the session persistence is not required if you use this feature in conjunction with Cilium kube-proxy replacement
.
Cilium
selects a backend for the service based on the flow hash, and the kernel picks source ports based on the flow hash of the encapsulated packet.
It means that the traffic belonging to the same TCP connection from a NAT client to a router service is always sent to the same Pod.
To enable auto-scaling with horizontal pod autoscaler (HPA), Egress
implements scale
subresource.
On-demand NAT for egress traffics is implemented with a CRD called Egress
.
Egress
is a namespace resource. It will create a Deployment
and Service
to run router pods in the same namespace.
In each router pod, coil-egress
runs to maintain FoU tunnels connected to its client pods.
For client pods, a special annotation coil.cybozu.com/egress
tells Coil to setup FoU tunnels for the given Egress
networks.
The annotation key is egress.coil.cybozu.com/NAMESPACE
, where NAMESPACE
is the namespace of Egress
.
The annotation value is a comma-separated list Egress
names.
apiVersion: v1
kind: Pod
metadata:
annotations:
egress.coil.cybozu.com/internet: egress
egress.coil.cybozu.com/other-net: egress
An Egress
has a list of external network addresses. Client pods that want to send packets to these networks should include the Egress
name in the annotation.
In a client pod, IP policy routing is setup as follows.
# IPv4 link local addresses must be excluded from this feature.
ip rule add to 169.254.0.0/16 pref 1800 table main
# Specific NAT destinations are registered in table 117.
ip rule add pref 1900 table 117
# IPv4 private network addresses.
ip rule add to 192.168.0.0/16 pref 2000 table main
ip rule add to 172.16.0.0/12 pref 2001 table main
ip rule add to 10.0.0.0/8 pref 2002 table main
# SNAT for 0.0.0.0/0 (tun1) must come last.
ip rule add pref 2100 table 118
ip route add default dev tun1 table 118
Users can update the existing NAT setup by editing the spec.destinations
and spec.fouSourcePortAuto
in the Egress resource.
coild
watches the Egress resources, and if it catches the updates then updates the NAT setup in the pods running on the same node following the updated Egress.
Currently, coil doesn't support updating the NAT configuration which Egress CRD does not include, such as FoU destination port, FoU peer(service ClusterIP). Users need to restart NAT client Pods in the cases as follows.
- Change the FoU tunnel port using the flags of
coild
andcoil-egress
- Remove services for router Pods and k8s assigns different ClusterIPs. (NAT clients have to send encapsulated packets to the new peer)
To understand this section, you need to know the Kubernetes garbage collection and finalizers. If you don't know much, read the following materials:
The owner reference of an AddressBlock
is set to the AddressPool
from which the block was allocated.
Therefore, when the deletion of the owning AddressPool
is directed, all AddressBlocks
from the pool is garbage collected automatically by Kubernetes.
That said, an AddressBlock
should not be deleted until there are no more Pods with an address in the block.
For this purpose, Coil adds a finalizer to each AddressBlock
. coild
checks the usage of addresses in the block, and once there are no more Pods using the addresses, it removes the finalizer to delete the AddressBlock
.
AddressBlock
should also be deleted when Node
that acquired the block is deleted. Since coild
running as a DaemonSet pod cannot do this, coil-controller
watches Node deletions and removes AddressBlocks
. coil-controller
periodically checks dangling AddressBlocks
and removes them.
coild
also deletes AddressBlock
when it frees the last IP address used in the block. At startup, coild
also checks each AddressBlock
for the Node, and if no Pod is using the addresses in the block, it deletes the AddressBlock
.
Note that Coil does not include Node
in the list of owner references of an AddressBlock
. This is because Kubernetes only deletes a resource after all owners in the owner references of the resource are deleted.
Similar to an AddressBlock
and its addresses, an AddressPool
should not be deleted until there are no more AddressBlock
s derived from the pool.
For this purpose, Coil adds a finalizer to each AddressPool
. coil-controller
checks the usage of blocks in the pool.
Note that blockOwnerDeletion: true
in AddressBlock
's ownerReferences
does not always block the deletion of the owning AddressPool
.
This directive has effect only when foreground cascading deletion is adopted.
Also note that a finalizer for an AddressPool
does not block the garbage collection on AddressBlock
s.
Normally, coild
is responsible to delete BlockRequest
created by itself.
In case that Node
where coild
is running is deleted, Coil adds the node into the BlockRequest
's owner references. This way, Kubernetes will collect orphaned BlockRequest
s.
We will prepare a tool to migrate existing Coil v1 cluster to v2. Pods run by Coil v1 should survive during and after v2 migration for smooth transition.
The following steps illustrate the transition:
-
Remove Coil v1 resources from Kubernetes. This stops new Pod creation and updating data in etcd.
-
Run the converter and save the generated manifests as
data.yaml
.This YAML contains
AddressPools
andAddressBlocks
. TheAddressBlocks
are marked as reserved. -
(option) Add annotations to namespaces to specify
AddressPool
. -
Apply Coil v2 CRDs.
-
Apply
data.yaml
. -
Apply other Coil v2 resources. This restarts Pod creation.
-
Remove and replace Pods run by Coil v1 one by one.
-
Remove all reserved
AddressBlocks
. -
Restart all
coild
Pods to resyncAddressBlocks
.
A migration tool called coil-migrator
helps step 1, 2, 3, 7, 8, and 9.
coil-migrator dump
does 1, 2, and 3.coil-migrator replace
does 7, 8, and 9.
Coil v2 will define and use the following custom resources:
AddressPool
: An address pool is a set of IP subnets.AddressBlock
: A block of IP addresses carved out of a pool.BlockRequest
: Each node uses this to request an assignment of a new address block.Egress
: represents an egress gateway for on-demand NAT feature.
These YAML snippets are intended to hint the implementation of Coil CRDs.
apiVersion: coil.cybozu.com/v2
kind: AddressPool
metadata:
name: pool1
spec:
blockSizeBits: 5
subnets:
- ipv4: 10.2.0.0/16
ipv6: fd01:0203:0405:0607::/112
apiVersion: coil.cybozu.com/v2
kind: AddressBlock
metadata:
name: pool1-NNN
labels:
coil.cybozu.com/pool: pool1
coil.cybozu.com/node: node1
finalizers: ["coil.cybozu.com"]
ownerReferences:
- apiVersion: coil.cybozu.com/v2
controller: true
blockOwnerDeletion: true
kind: AddressPool
name: pool1
uid: d9607e19-f88f-11e6-a518-42010a800195
index: 16
ipv4: 10.2.2.0/27
ipv6: fd01:0203:0405:0607::0200/123
apiVersion: coil.cybozu.com/v2
kind: BlockRequest
metadata:
name: <random name>
ownerReferences:
- apiVersion: v1
controller: false
blockOwnerDeletion: false
kind: Node
name: node1
uid: d9607e19-f88f-11e6-a518-42010a800195
spec:
nodeName: node1
poolName: pool1
status:
addressBlockName: pool1-NNN
conditions:
- type: Complete # or Failed
status: True # or False or Unknown
reason: "reason of the error"
message: "a human readable message"
Egress generates a Deployment and a Service. So it has fields to customize them.
To support auto scaling by HPA, it has some status fields for it.
apiVersion: coil.cybozu.com/v2
kind: Egress
metadata:
name: internet
namespace: internet-egress
spec:
destinations:
- 0.0.0.0/0
replicas: 2
strategy:
type: Recreate
template:
metadata:
annotations:
foo: bar
labels:
name: coil-egress
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
name: coil-egress
topologyKey: topology.kubernetes.io/zone
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 43200 # 12 hours
status:
replicas: 1
selector: "coil.cybozu.com%2Fname=internet"