forked from cloudfoundry/docs-cloudfoundry-concepts
-
Notifications
You must be signed in to change notification settings - Fork 0
/
container-security.html.md.erb
140 lines (109 loc) · 9.39 KB
/
container-security.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
title: Understanding Container Security
owner: Security
---
<strong><%= modified_date %></strong>
This topic describes how Cloud Foundry (CF) secures the containers that host application instances on Linux. For an overview of other CF security features, see the [Understanding Cloud Foundry Security](security.html) topic.
* [Container Mechanics](#mechanics) provides an overview of container isolation.
* [Inbound and Outbound Traffic from CF](#networking) provides an [overview](#net-overview) of container networking and describes how CF administrators [customize](#config-traffic) container network traffic rules for their deployment.
* [Container Security](#security) describes how CF secures containers by running application instances in [unprivileged](#types) containers and by [hardening](#hardening) them.
## <a id='mechanics'></a>Container Mechanics###
Each instance of an app deployed to CF runs within its own self-contained environment, a [Garden container](./architecture/garden.html). This container isolates processes, memory, and the filesystem using operating system features and the characteristics of the virtual and physical infrastructure where CF is deployed.
CF achieves container isolation by namespacing kernel resources that would otherwise be shared. The intended level of isolation is set to prevent multiple containers that are present on the same host from detecting each other. Every container includes a private root filesystem, which includes a Process ID (PID), namespace, network namespace, and mount namespace.
CF creates container filesystems using the [Garden Rootfs](./architecture/garden.html#garden-rootfs) (GrootFS) tool. It stacks the following using OverlayFS:
* A **read-only base filesystem**: This filesystem has the minimal set of operating system packages and Garden-specific modifications common to all containers.
Containers can share the same read-only base filesystem because all writes are applied to the read-write layer.
* A **container-specific read-write layer**: This layer is unique to each container and its size is limited by XFS project quotas.
The quotas prevent the read-write layer from overflowing into unallocated space.
Resource control is managed using Linux control groups. Associating each container with its own cgroup or job object limits the amount of memory that the container may use. Linux cgroups also require the container to use a fair share of CPU compared to the relative CPU share of other containers.
<p class="note">
<strong>Note</strong>: CF does not support a RedHat Enterprise Linux OS stemcell. This is due to an inherent security issue with the way RedHat handles user namespacing and container isolation.
</p>
## <a id='networking'></a>Inbound and Outbound Traffic from CF
### <a id='net-overview'></a>Networking Overview ###
A host VM has a single IP address.
If you configure the deployment with the cluster on a VLAN, as recommended,
then all traffic goes through the following levels of network address translation, as shown in the diagram below.
* <strong>Inbound</strong> requests flow from the load balancer through the router to the host cell, then into the application container. The router determines which application instance receives each request.
* <strong>Outbound</strong> traffic flows from the application container to the cell, then to the gateway on the cell's virtual network interface. Depending on your IaaS, this gateway may be a NAT to external networks.
<%= image_tag("images/security/sysbound2.png") %>
### <a id='config-traffic'></a>Network Traffic Rules ###
Administrators configure rules to govern container network traffic.
This is how containers send traffic outside of CF and receive traffic from outside, the Internet.
These rules can prevent system access from external networks and between internal components
and determine if apps can establish connections over the virtual network interface.
Administrators configure these rules at two levels:
* Application Security Groups (ASGs) apply network traffic rules at the container level. <%=vars.app_sec_groups_link%>
* Container-to-Container networking policies determine app-to-app communication.
Within CF, apps can communicate directly with each other, but the containers are isolated from outside CF. <%=vars.container_network_link %>
## <a id='security'></a>Container Security###
CF secures containers through the following measures:
* Running application instances in [unprivileged](#types) containers by default
* [Hardening](#hardening) containers by limiting functionality and access rights
* <%=vars.app_sec_groups_default%> <%=vars.app_sec_groups_link%>
### <a id="types"></a> Types
Garden has two container types: unprivileged and privileged. Currently, CF runs all application instances and staging tasks in unprivileged containers by default. This measure increases security by eliminating the threat of root escalation inside the container.
Formerly, CF ran apps based on Docker images in unprivileged containers, and buildpack-based apps and staging tasks in privileged containers. CF ran apps based on Docker images in unprivileged containers because Docker images come with their own root filesystem and user, so CF could not trust the root filesystem and could not assume that the container user process would never be root. CF ran build-pack based apps and staging tasks in privileged containers because they used the cflinuxfs2 root filesystem and all processes were run as the unprivileged user `vcap`.
<% if vars.product_name == 'CF' %>
<%= partial 'config_manifest' %>
<% end %>
### <a id='hardening'></a> Hardening
CF mitigates against container breakout and denial of service attacks in the following ways:
* CF uses the full set of [Linux namespaces](http://manpages.ubuntu.com/manpages/xenial/en/man7/namespaces.7.html) (IPC, Network, Mount, PID, User, UTS) to provide isolation between containers running on the same host. The User namespace is not used for privileged containers.
* In unprivileged containers, CF maps UID/GID 0 (root) inside the container user namespace to a different UID/GID on the host to prevent an app from inheriting UID/GID 0 on the host if it breaks out of the container.
* CF uses the same UID/GID for all containers.
* CF maps all UIDs except UID 0 to themselves. CF maps UID 0 inside the container namespace to `MAX_UID-1` outside of the container namespace.
* Container Root does not grant Host Root permissions.
* CF mounts `/proc` and `/sys` as read-only inside containers.
* CF disallows `dmesg` access for unprivileged users and all users in unprivileged containers.
* CF uses `chroot` when importing docker images from docker registries.
* CF establishes a container-specific overlay filesystem mount. CF uses [`pivot_root`](http://manpages.ubuntu.com/manpages/trusty/man2/pivot_root.2.html) to move the root filesystem into this overlay, in order to isolate the container from the host system’s filesystem.
* CF does not call any binary or script inside the container filesystem, in order to eliminate any dependencies on scripts and binaries inside the root filesystem.
* CF avoids side-loading binaries in the container through bind mounts or other methods. Instead, it re-executes the same binary by reading it from `/proc/self/exe` whenever it needs to run a binary in a container.
* CF establishes a virtual Ethernet pair for each container for network traffic. See the [Container Network Traffic](#config-traffic) section above for more information. The virtual Ethernet pair has the following features:
* One interface in the pair is inside the container’s network namespace, and is the only non-loopback interface accessible inside the container.
* The other interface remains in the host network namespace and is bridged to the container-side interface.
* Egress whitelist rules are applied to these interfaces according to Application Security Groups (ASGs) configured by the administrator.
* First-packet logging rules may also be enabled on TCP whitelist rules.
* DNAT rules are established on the host to enable traffic ingress from the host interface to whitelisted ports on the container-side interface.
* CF applies disk quotas using container-specific XFS quotas with the specified disk-quota capacity.
* CF applies a total memory usage quota through the memory cgroup and destroys the container if the memory usage exceeds the quota.
* CF applies a fair-use limit to CPU usage for processes inside the container through the `cpu.shares` control group.
* CF limits access to devices using cgroups but explicitly whitelists the following safe device nodes:
* `/dev/full`
* `/dev/fuse`
* `/dev/null`
* `/dev/ptmx`
* `/dev/pts/*`
* `/dev/random`
* `/dev/tty`
* `/dev/tty0`
* `/dev/tty1`
* `/dev/urandom`
* `/dev/zero`
* `/dev/tap`
* `/dev/tun`
* CF drops the following [Linux capabilities](http://manpages.ubuntu.com/manpages/xenial/en/man7/capabilities.7.html) for all container processes. Every dropped capability limits the actions the root user can perform.
* `CAP_DAC_READ_SEARCH`
* `CAP_LINUX_IMMUTABLE`
* `CAP_NET_BROADCAST`
* `CAP_NET_ADMIN`
* `CAP_IPC_LOCK`
* `CAP_IPC_OWNER`
* `CAP_SYS_MODULE`
* `CAP_SYS_RAWIO`
* `CAP_SYS_PTRACE`
* `CAP_SYS_PACCT`
* `CAP_SYS_BOOT`
* `CAP_SYS_NICE`
* `CAP_SYS_RESOURCE`
* `CAP_SYS_TIME`
* `CAP_SYS_TTY_CONFIG`
* `CAP_LEASE`
* `CAP_AUDIT_CONTROL`
* `CAP_MAC_OVERRIDE`
* `CAP_MAC_ADMIN`
* `CAP_SYSLOG`
* `CAP_WAKE_ALARM`
* `CAP_BLOCK_SUSPEND`
* `CAP_SYS_ADMIN` (for unprivileged containers)