Skip to content

Latest commit

 

History

History
116 lines (74 loc) · 7.58 KB

File metadata and controls

116 lines (74 loc) · 7.58 KB

Questions and answers

The goal of this file is to have a place to easily commit answers to questions in a way that's easily searchable, and can make its way into official documentation later.

Q: When I customize the Ignition generated by the installer, how does the MCO handle that?

This is a supported operation. Today, the MCO does not have good support for "per node" configuration, and configuring things like static IP addresses and partition layouts by customizing the Ignition makes sense.

However, it's important to understand that these custom changes are "invisible" to the MCO today - they won't show up in oc get machineconfig. And hence it's not as straightforward to make any "day 2" changes to them.

In the future, it's likely the MCO will gain better support for per-node configuration as well as tools to more easily manipulate Ignition, so there is less need to edit the Ignition JSON directly.

Q: Why are my workers showing older versions of RHCOS?

Today, the MCO only blocks on upgrades of control plane nodes. oc get clusterversion effectively reports the version of the control plane.

To watch rollout of worker nodes, you should look at oc describe machineconfigpool/worker (as well as other custom pools, if any).

Q: How does this relate to Machine API?

There are two fundamental operators in OpenShift 4 that both include "machine" in their name:

The Machine Config Operator (this repository) manages code and configuration "inside" the OS (and targets specifcally RHCOS).

The Machine API Operator manages "machine" objects which represent underlying IaaS virtual (or physical) machines.

In other words, they operate on fundamentally different levels, but they do interact. For example, both currently will drain a node. The MCO will drain when it's making changes, and machine API will drain when a machine object is deleted and has an associated node.

Another linkage between the two is booting an instance; in IaaS scenarios the "user data" field (managed by machineAPI) will contain a "pointer Ignition config" that points to the Machine Config Server.

However, these repositories have distinct teams. Also, machineAPI is a derivative of a Kubernetes upstream project "cluster API", whereas the MCO is not.

Q: If I change something manually on the host, will the MCO revert it?

Usually, no. Today, the MCO does not try to claim "exclusive" ownership over everything on the host system; it's just not feasible to do.

If for example you write a daemonset that writes a custom systemd unit into e.g. /etc/systemd/system, or do so manually via ssh/oc debug node - OS upgrades will preserve that change (via libostree), and the MCO will not revert it. The MCO/MCD only changes files included in MachineConfigs, there is no code to look for "unknown" files.

Another case today is that the SDN operator will extract some binaries from its container image and drop them in /opt (which is really /var/opt).

Stated more generally, on an OSTree managed system, all content in /etc and /var is preserved by default across upgrades.

Further, rpm-ostree supports package layering and overrides - these will also be preserved by the MCO (currently). Although note that there is no current mechanism to trigger a MCO-coordinated drain/reboot, which is particularly relevant for rpm-ostree install/override changes.

If a file that is managed by MachineConfig is changed, the MCD will detect this and go degraded. We go degraded rather than overwrite in order to avoid reboot loops.

In the future, we would like to harden things more so that these things are more controlled, and ideally avoid having any persistent "unmanaged" state. But it will take significant work to get there; and the status quo means that we can support other operators such as SDN (and e.g. nmstate) that may control parts of the host without the MCO's awareness.

Q: How do I debug a node failing to join the cluster?

In clusters that are managed by the Machine API, see this question first.

A node failing to join can fail broadly speaking at two separate stages; inside the initramfs (Ignition), or in the real root. If Ignition fails, at the moment this requires accessing the console of the affected machine. See also this issue.

If the system fails in the real root, and you have configured a SSH key for the cluster, you should be able to ssh to the node. A good first command is systemctl --failed. Important units to look at would be machine-config-daemon-firstboot.service and kubelet.service - in general, the problem is likely to be some dependency of kubelet.

Q: Can I use the MCO to re-partition or re-install?

Not today. The MachineConfig doc discusses which sections of the rendered Ignition can be changed, and that does not include e.g. the Ignition storage section. For example, you cannot currently switch an existing worker node to be encrypted or use RAID after the fact - you must re-provision the system.

The MCO also does not currently support explicitly re-provisioning a system "in place", however this is likely to be a future feature. For now, in machineAPI managed environments you should oc delete the corresponding machine object, or in UPI installations, cordon and drain the node, then delete the node object and re-provision.

A further problem is that the MCO does not make it easy for new nodes to boot in the new configuration.

Related issues:

All this to say, it's quite hard to change storage layout with the MCO today, but this is a bug.

Q: Why am I getting "error: No enabled repositories" from rpm-ostree?

As of OCP 4.13 and below, when any extensions are enabled (e.g. kernel-rt, usbguard etc.) the MCO will provision an rpm-md (yum/dnf) repository only at the time the MCD is making changes to the host system.

This means that any other invocations of rpm-ostree (e.g. a manual rpm-ostree initramfs --enable, or rpm-ostree install etc.) won't have that repository enabled.

In some cases, creating an empty/dummy repository in /etc/yum.repos.d may suffice to work around this bug. Or, adding the RHEL UBI repository or full RHEL entitlements may work.

In the future, the MCO may change to just leave the repository enabled and avoid this.

However, if you're doing deep changes to the host, it may work significantly better to instead switch to an image layering model.

Q: Does the MCO run on RHEL worker nodes?

Yes, RHEL worker nodes will have a instance of the Machine Config Daemon running on them. However, only a subset of MCO functionality is supported on RHEL worker nodes. It is possible to create a Machine Config to write files and systemd units to RHEL worker nodes, but it is not possible to manage OS updates, kernel arguments, or extensions on RHEL worker nodes.