Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS bock mode devices sometimes don't appear in time (or disappear) #1505

Open
mschiff opened this issue Dec 13, 2024 · 2 comments
Open

ZFS bock mode devices sometimes don't appear in time (or disappear) #1505

mschiff opened this issue Dec 13, 2024 · 2 comments
Labels
Incomplete Waiting on more information from reporter

Comments

@mschiff
Copy link

mschiff commented Dec 13, 2024

Required information

  • Distribution: Gentoo
  • Distribution version: --
  • The output of "incus info":
config: {}
api_extensions:
[…]
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: root
auth_user_method: unix
environment:
  addresses: []
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIIB+zCCAYGgAwIBAgIRAICBOYWaBvVjqdDlc8YUw/swCgYIKoZIzj0EAwMwMDEZ
    MBcGA1UEChMQTGludXggQ29udGFpbmVyczETMBEGA1UEAwwKcm9vdEBwbHV0bzAe
    Fw0yNDAxMTIxNTUzNTNaFw0zNDAxMDkxNTUzNTNaMDAxGTAXBgNVBAoTEExpbnV4
    IENvbnRhaW5lcnMxEzARBgNVBAMMCnJvb3RAcGx1dG8wdjAQBgcqhkjOPQIBBgUr
    gQQAIgNiAASHXIMZRZz07Fvdh1ehgSKNc3yk/fYbTPMGbOF3VyeDWVWEsjhHQFeW
    rJ9EPYNjogBUba3NKVaEmMAz+nKDM4TcwwLytIUAFLFyIqaRnuxFHF6RXU/g7jGV
    RzwqYxSzCcujXzBdMA4GA1UdDwEB/wQEAwIFoDATBgNVHSUEDDAKBggrBgEFBQcD
    ATAMBgNVHRMBAf8EAjAAMCgGA1UdEQQhMB+CBXBsdXRvhwR/AAABhxAAAAAAAAAA
    AAAAAAAAAAABMAoGCCqGSM49BAMDA2gAMGUCMQD9KzSklLuzuifRoJvKuwCPmn9r
    UyfCac0hEA5Kj57/hgJ0y/+n3ODpln3JuAdn95MCMBJ8b5/W6x6sxC2e+vso+lfD
    LNTgFQ6Ha6FfWEKBTGPb0KPr3XICf5IMIWb6kI3OiA==
    -----END CERTIFICATE-----
  certificate_fingerprint: a75220cf363eb0c0ffd6427f5d9d88510d5d608d335514b76548b47e37bb4e69
  driver: lxc | qemu
  driver_version: 6.0.2 | 8.2.3
  firewall: xtables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_binfmt: "false"
    unpriv_fscaps: "true"
  kernel_version: 6.6.62-gentoo-dist
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Gentoo
  os_version: "2.17"
  project: default
  server: incus
  server_clustered: false
  server_event_mode: full-mesh
  server_name: pluto
  server_pid: 1857954
  server_version: "6.7"
  storage: zfs
  storage_version: 2.2.5-r0-gentoo
  storage_supported_drivers:
  - name: zfs
    version: 2.2.5-r0-gentoo
    remote: false
  - name: dir
    version: "1"
    remote: false

Issue description

Here is another one on the same setup as in #1483

While running incus-benchmark launch --count 200 --parallel 11 (with --parallel >= 11 on that 12 thread machine), some instances fail with this errors:

(sysadm_r)@pluto ~ # incus-benchmark launch --count 200 --parallel 11
Test environment:
  Server backend: incus
  Server version: 6.7
  Kernel: Linux
  Kernel tecture: x86_64
  Kernel version: 6.6.62-gentoo-dist
  Storage backend: zfs
  Storage version: 2.2.5-r0-gentoo
  Container backend: lxc | qemu
  Container version: 6.0.2 | 8.2.3

Test variables:
  Container count: 200
  Container mode: unprivileged
  Startup mode: normal startup
  Image: images:ubuntu/22.04
  Batches: 18
  Batch size: 11
  Remainder: 2

[Dec 13 09:37:00.962] Found image in local store: 3bf7665e786f9baf780a5774fae5b43f5a29c79d76af1474ff6e8f212fcfa25d
[Dec 13 09:37:00.962] Batch processing start
[Dec 13 09:37:58.761] Processed 11 containers in 57.798s (0.190/s)
[Dec 13 09:38:54.180] Processed 22 containers in 113.217s (0.194/s)
[Dec 13 09:40:51.015] Processed 44 containers in 230.053s (0.191/s)
[Dec 13 09:41:25.084] Failed to launch container 'benchmark-052': Failed creating instance from image: Failed to activate volume: Failed to locate zvol for "zdata/incus/containers/benchmark-052": context deadline exceeded
[Dec 13 09:41:25.099] Failed to launch container 'benchmark-055': Failed creating instance from image: Failed to activate volume: Failed to locate zvol for "zdata/incus/containers/benchmark-055": context deadline exceeded
[Dec 13 09:44:18.373] Processed 88 containers in 437.411s (0.201/s)
[Dec 13 09:44:53.504] Failed to launch container 'benchmark-093': Failed creating instance from image: Failed to activate volume: Failed to locate zvol for "zdata/incus/containers/benchmark-093": context deadline exceeded
[Dec 13 09:45:43.469] Failed to launch container 'benchmark-107': Failed creating instance from image: Failed to mount "/dev/zvol/zdata/incus/containers/benchmark-107" on "/var/lib/incus/storage-pools/default/containers/benchmark-107" using "ext4": no such file or directory
[Dec 13 09:51:44.892] Processed 176 containers in 883.929s (0.199/s)
[Dec 13 09:53:45.504] Batch processing completed in 1004.542s
(sysadm_r)@pluto ~ #

indeed 4 instances missing:

(sysadm_r)@pluto ~ # incus list|grep benchmark| wc -l
196
(sysadm_r)@pluto ~ #

The ZFS storage itself is fine and healthy. The mentioned filesets are missing in the pool. If I increase --parallel further, I get more failing instances (both, "context deadline exceeded" and "no such file or directory")

Steps to reproduce

  1. Have zfs storage backend (here 2x nvme as system and zfs special device and 2x spinning rust for data)
  2. run above incus-benchmark command with --parallel matching (alomost) system cpu threads

Information to attach

I also see following errors, but also for instances that did not fail in the end:

Dez 13 09:43:25 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-086: Can't open blockdev
Dez 13 09:43:25 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-086: Can't lookup blockdev
Dez 13 09:43:26 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-086: Can't lookup blockdev
Dez 13 09:43:26 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-086: Can't lookup blockdev
Dez 13 09:43:27 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-086: Can't lookup blockdev
Dez 13 09:45:08 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't open blockdev
Dez 13 09:45:09 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:09 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:10 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:10 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:11 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:12 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:12 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:13 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:13 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:14 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:14 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:15 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:15 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:16 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:16 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:17 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:17 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:18 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:18 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:36 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-110: Can't open blockdev
Dez 13 09:52:19 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-183: Can't open blockdev
Dez 13 09:52:34 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-195: Can't open blockdev
Dez 13 09:52:34 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-198: Can't open blockdev
Dez 13 09:52:34 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-195: Can't lookup blockdev
Dez 13 09:52:35 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-198: Can't lookup blockdev
Dez 13 09:52:35 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-195: Can't lookup blockdev
Dez 13 09:52:35 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-198: Can't lookup blockdev

To me this looks like load issues and incusd not waiting long enough for devices to appear? Because it eventually gave up on 107 (does not exist), but not for example on 086 (exists and is running)

Here is some of the the output of incus monitor --pretty, but from another run.

benchmark-055 failed with "no such file or directory":

(sysadm_r)@pluto ~ # grep benchmark-055 incus-monitor.log 
time="2024-12-13T10:35:54+01:00" level=info msg="ID: 7e31f71c-9194-40da-aaa8-eca2192195c2, Class: task, Description: Creating instance" CreatedAt="2024-12-13 10:35:54.267475495 +0100 CET" Err= Location=none MayCancel=false Metadata="map[]" Resources="map[instances:[/1.0/instances/benchmark-055]]" Status=Pending StatusCode=Pending UpdatedAt="2024-12-13 10:35:54.267475495 +0100 CET"
time="2024-12-13T10:35:54+01:00" level=info msg="ID: 7e31f71c-9194-40da-aaa8-eca2192195c2, Class: task, Description: Creating instance" CreatedAt="2024-12-13 10:35:54.267475495 +0100 CET" Err= Location=none MayCancel=false Metadata="map[]" Resources="map[instances:[/1.0/instances/benchmark-055]]" Status=Running StatusCode=Running UpdatedAt="2024-12-13 10:35:54.267475495 +0100 CET"
time="2024-12-13T10:35:54+01:00" level=debug msg="Instance operation lock created" action=create instance=benchmark-055 project=default reusable=false
time="2024-12-13T10:35:54+01:00" level=info msg="Creating instance" ephemeral=false instance=benchmark-055 instanceType=container project=default
time="2024-12-13T10:35:54+01:00" level=debug msg="Adding device" device=eth0 instance=benchmark-055 instanceType=container project=default type=nic
time="2024-12-13T10:35:54+01:00" level=info msg="Action: instance-created, Source: /1.0/instances/benchmark-055, Requestor: unix/root (@)" location=none storage-pool=default type=container
time="2024-12-13T10:35:54+01:00" level=info msg="Created instance" ephemeral=false instance=benchmark-055 instanceType=container project=default
time="2024-12-13T10:35:54+01:00" level=debug msg="Adding device" device=root instance=benchmark-055 instanceType=container project=default type=disk
time="2024-12-13T10:35:54+01:00" level=debug msg="CreateInstanceFromImage started" driver=zfs instance=benchmark-055 pool=default project=default
time="2024-12-13T10:35:56+01:00" level=debug msg="Activated ZFS volume" dev=zdata/incus/containers/benchmark-055 driver=zfs pool=default volName=benchmark-055
time="2024-12-13T10:36:14+01:00" level=debug msg="Deactivated ZFS volume" dev=zdata/incus/containers/benchmark-055 driver=zfs pool=default volName=benchmark-055
time="2024-12-13T10:36:15+01:00" level=debug msg="Removing device" device=root instance=benchmark-055 instanceType=container project=default type=disk
time="2024-12-13T10:36:15+01:00" level=debug msg="CreateInstanceFromImage finished" driver=zfs instance=benchmark-055 pool=default project=default
time="2024-12-13T10:36:15+01:00" level=debug msg="Instance operation lock finished" action=create err="<nil>" instance=benchmark-055 project=default reusable=false
time="2024-12-13T10:36:15+01:00" level=debug msg="Removing device" device=eth0 instance=benchmark-055 instanceType=container project=default type=nic
time="2024-12-13T10:36:15+01:00" level=info msg="ID: 7e31f71c-9194-40da-aaa8-eca2192195c2, Class: task, Description: Creating instance" CreatedAt="2024-12-13 10:35:54.267475495 +0100 CET" Err="Failed creating instance from image: Failed to mount \"/dev/zvol/zdata/incus/containers/benchmark-055\" on \"/var/lib/incus/storage-pools/default/containers/benchmark-055\" using \"ext4\": no such file or directory" Location=none MayCancel=false Metadata="map[]" Resources="map[instances:[/1.0/instances/benchmark-055]]" Status=Failure StatusCode=Failure UpdatedAt="2024-12-13 10:35:54.267475495 +0100 CET"
time="2024-12-13T10:36:15+01:00" level=debug msg="Failure for operation" class=task description="Creating instance" err="Failed creating instance from image: Failed to mount \"/dev/zvol/zdata/incus/containers/benchmark-055\" on \"/var/lib/incus/storage-pools/default/containers/benchmark-055\" using \"ext4\": no such file or directory" operation=7e31f71c-9194-40da-aaa8-eca2192195c2 project=default
(sysadm_r)@pluto ~ # 

benchmark-082 failed with context deadline exceeded

(sysadm_r)@pluto ~ # grep benchmark-082 incus-monitor .log
time="2024-12-13T10:38:30+01:00" level=info msg="ID: 69bfdd5e-d71c-4503-a18d-56fac3a0f59c, Class: task, Description: Creating instance" CreatedAt="2024-12-13 10:38:30.739073919 +0100 CET" Err= Location=none MayCancel=false Metadata="map[]" Resources="map[instances:[/1.0/instances/benchmark-082]]" Status=Pending StatusCode=Pending UpdatedAt="2024-12-13 10:38:30.739073919 +0100 CET"
time="2024-12-13T10:38:30+01:00" level=info msg="ID: 69bfdd5e-d71c-4503-a18d-56fac3a0f59c, Class: task, Description: Creating instance" CreatedAt="2024-12-13 10:38:30.739073919 +0100 CET" Err= Location=none MayCancel=false Metadata="map[]" Resources="map[instances:[/1.0/instances/benchmark-082]]" Status=Running StatusCode=Running UpdatedAt="2024-12-13 10:38:30.739073919 +0100 CET"
time="2024-12-13T10:38:30+01:00" level=debug msg="Instance operation lock created" action=create instance=benchmark-082 project=default reusable=false
time="2024-12-13T10:38:30+01:00" level=info msg="Creating instance" ephemeral=false instance=benchmark-082 instanceType=container project=default
time="2024-12-13T10:38:31+01:00" level=debug msg="Adding device" device=eth0 instance=benchmark-082 instanceType=container project=default type=nic
time="2024-12-13T10:38:31+01:00" level=debug msg="Adding device" device=root instance=benchmark-082 instanceType=container project=default type=disk
time="2024-12-13T10:38:31+01:00" level=info msg="Action: instance-created, Source: /1.0/instances/benchmark-082, Requestor: unix/root (@)" location=none storage-pool=default type=container
time="2024-12-13T10:38:31+01:00" level=info msg="Created instance" ephemeral=false instance=benchmark-082 instanceType=container project=default
time="2024-12-13T10:38:31+01:00" level=debug msg="CreateInstanceFromImage started" driver=zfs instance=benchmark-082 pool=default project=default
time="2024-12-13T10:39:11+01:00" level=debug msg="Removing device" device=eth0 instance=benchmark-082 instanceType=container project=default type=nic
time="2024-12-13T10:39:11+01:00" level=debug msg="CreateInstanceFromImage finished" driver=zfs instance=benchmark-082 pool=default project=default
time="2024-12-13T10:39:11+01:00" level=debug msg="Instance operation lock finished" action=create err="<nil>" instance=benchmark-082 project=default reusable=false
time="2024-12-13T10:39:11+01:00" level=debug msg="Removing device" device=root instance=benchmark-082 instanceType=container project=default type=disk
time="2024-12-13T10:39:11+01:00" level=info msg="ID: 69bfdd5e-d71c-4503-a18d-56fac3a0f59c, Class: task, Description: Creating instance" CreatedAt="2024-12-13 10:38:30.739073919 +0100 CET" Err="Failed creating instance from image: Failed to activate volume: Failed to locate zvol for \"zdata/incus/containers/benchmark-082\": context deadline exceeded" Location=none MayCancel=false Metadata="map[]" Resources="map[instances:[/1.0/instances/benchmark-082]]" Status=Failure StatusCode=Failure UpdatedAt="2024-12-13 10:38:30.739073919 +0100 CET"
time="2024-12-13T10:39:11+01:00" level=debug msg="Failure for operation" class=task description="Creating instance" err="Failed creating instance from image: Failed to activate volume: Failed to locate zvol for \"zdata/incus/containers/benchmark-082\": context deadline exceeded" operation=69bfdd5e-d71c-4503-a18d-56fac3a0f59c project=default
(sysadm_r)@pluto ~ # 

@mschiff mschiff closed this as completed Dec 13, 2024
@mschiff mschiff reopened this Dec 13, 2024
@stgraber
Copy link
Member

Is your system using udev? Do you also have /lib/udev/zvol_id or zvol_id available in the $PATH?

Normally Incus relies on udev to automatically create the needed /dev/zvol entries. When that doesn't happen, it can also use zvol_id as a fallback mechanism to try to figure out the right device.

There is a 30s timeout for that logic at which point you get the Failed to locate zvol for error.
The other cases would generally imply that we did locate a /dev/zvol device but that it was somehow gone by the time we tried to mount it, or maybe was only partially set up by that point.

I'm running your example on a system also using ZFS block mode here but so far without any problem.

@stgraber stgraber added the Incomplete Waiting on more information from reporter label Dec 13, 2024
@stgraber stgraber changed the title failing instances during incus-benchmark with ZFS storage backend ZFS bock mode devices sometimes don't appear in time (or disappear) Dec 13, 2024
@mschiff
Copy link
Author

mschiff commented Dec 13, 2024

Is your system using udev?

yes

Do you also have /lib/udev/zvol_id or zvol_id available in the $PATH?

yes

Normally Incus relies on udev to automatically create the needed /dev/zvol entries. When that doesn't happen, it can also use zvol_id as a fallback mechanism to try to figure out the right device.

There is a 30s timeout for that logic at which point you get the Failed to locate zvol for error. The other cases would generally imply that we did locate a /dev/zvol device but that it was somehow gone by the time we tried to mount it, or maybe was only partially set up by that point.

I'm running your example on a system also using ZFS block mode here but so far without any problem.

How much system load do you get? I am having a very high IOwait when it happens. Is there any way to rais this 30s timeout just to test if waiting longer would help here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Incomplete Waiting on more information from reporter
Development

No branches or pull requests

2 participants