Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc error: code = Unavailable desc = transport is closing => Transport Endpoint Not Connected - using Seaweedfs with Nomad #147

Open
Lukas8342 opened this issue Dec 8, 2023 · 9 comments

Comments

@Lukas8342
Copy link

Hello,

I'm encountering an issue that I'm unsure whether it stems from SeaweedFS, SeaweedFS-csi-driver or HashiCorp Nomad. I'm reaching out here as a starting point, hoping for guidance as my troubleshooting options are running thin. In my current setup, I have one master, one filer, and one volume server, all running on the same machine with these configurations:

bash

weed master -ip=85.215.193.71 -ip.bind=0.0.0.0 -mdir=/seatest/m -port=9333 -port.grpc=19333
weed volume -mserver=85.215.193.71:9333 -dir=/seatest/d -dataCenter=dc1 -ip=85.215.193.71 -max=30 -ip.bind=0.0.0.0 -port=8080 -port.grpc=18080
weed filer ip=85.215.193.71 -master=85.215.193.71:9333 -dataCenter=dc1 -rack=rack1

When utilizing the CSI with the following Nomad job:

hcl

job "seaweedfs-plugin" {
  datacenters = ["dc1"]
  type        = "system"

  constraint {
    operator = "distinct_hosts"
    value    = true
  }

  group "nodes" {
    task "plugin" {
      driver = "docker"

      config {
        image      = "chrislusf/seaweedfs-csi-driver"
        privileged = true

        args = [
          "--endpoint=unix://csi/csi.sock",
          "--filer=10.7.230.11:8888",
          "--nodeid=${node.unique.name}",
          "--cacheCapacityMB=1000",
          "--cacheDir=/tmp",
        ]
      }

      csi_plugin {
        id        = "seaweedfs"
        type      = "monolith"
        mount_dir = "/csi"
      }
    }
  }
}

It initially appears to work, but upon running jobs with different images, I consistently encounter a "Transport endpoint is not connected" error.

The filer logs display the following when starting a job and mounting it to a volume:

bash

I1208 15:45:26.862162 filer_grpc_server_sub_meta.go:268 => client [email protected]:52516: rpc error: code = Unavailable desc = transport is closing
E1208 15:45:26.862195 filer_grpc_server_sub_meta.go:78 processed to 2023-12-08 15:45:26.861541202 +0000 UTC: rpc error: code = Unavailable desc = transport is closing
I1208 15:45:26.862584 filer_grpc_server_sub_meta.go:312 -  listener [email protected]:52516 clientId -399912238 clientEpoch 2
I1208 15:45:26.862933 filer_grpc_server_sub_meta.go:296 +  listener [email protected]:54900 clientId -1540680978 clientEpoch 2
I1208 15:45:26.862949 filer_grpc_server_sub_meta.go:36  [email protected]:54900 starts to subscribe /buckets/dat from 2023-12-08 15:45:26.862037157 +0000 UTC

Nomad volume mounting is done as follows:

hcl

job "sonatype-nexus" {
  datacenters = ["dc1"]

  group "nexus" {
    count = 1
    network {
      port "http" {
        static = 8081
      }
    }

    volume "vol" {
      type           = "csi"
      read_only      = false
      source         = "nexus-volume"
      access_mode    = "single-node-writer"
      attachment_mode = "file-system"
    }

    task "server" {
      driver = "docker"
      volume_mount {
        volume      = "vol"
        destination = "/nexus-data"
        read_only   = false
      }

      config {
        image = "sonatype/nexus3:latest"
        ports = ["http"]
      }

      resources {
        cpu    = 2000
        memory = 4000
      }
    }
  }
}

I appreciate any insights or guidance you can provide to help resolve this issue.

Thank you.

@Lukas8342 Lukas8342 changed the title rpc error: code = Unavailable desc = transport is closing rpc error: code = Unavailable desc = transport is closing => Transport Endpoint Not Connected Dec 8, 2023
@Lukas8342 Lukas8342 changed the title rpc error: code = Unavailable desc = transport is closing => Transport Endpoint Not Connected rpc error: code = Unavailable desc = transport is closing => Transport Endpoint Not Connected - using Seaweedfs with Nomad Dec 8, 2023
@worotyns
Copy link

worotyns commented Jan 13, 2024

Same here on nomad, csi-plugin logs also:
panic: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined


On image :
chrislusf/seaweedfs-csi-driver:v1.1.8

Works fine :) - It's look like this changes cause a problem:
785e69a

@qskousen-membersy
Copy link

Can confirm that latest image is broken for me on Nomad with the error messages referencing Kubernetes, and that using v1.1.8 as @worotyns suggested works.

@chrislusf
Copy link
Contributor

@duanhongyi please take a look here.

@duanhongyi
Copy link
Contributor

duanhongyi commented Feb 27, 2024

@chrislusf
It seems to be incompatible with Nomad.
KUBERNETES-SERVICE_HOST env does not exist in Nomad.

Let me take a look in the next few days.

@nahsi
Copy link

nahsi commented Jun 4, 2024

Still broken in the latest version.

I think this commit has completely broke this CSI driver
785e69a#diff-d7f330f6d6efcabc25613925c10237045948e05bc020c7ecf16c3b331e371e62

@chrislusf
Copy link
Contributor

Send a PR to revert this change?

@duanhongyi
Copy link
Contributor

duanhongyi commented Jun 5, 2024

@chrislusf

I think it can be downgraded, that is, Nomad CSI does not support limited capacity.

Is this feasible? This modification is the simplest. I currently do not have a Nomad cluster to experiment with.

The pseudocode is as follows, mainly looking at the maxVolumeSize variable:

func GetVolumeCapacity(volumeId string) (int64, error) {
	client, err := NewInCluster()
	if err != nil {
		return maxVolumeSize, nil
	}
	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
	defer cancel()

	if volume, err := client.CoreV1().PersistentVolumes().Get(ctx, volumeId, metav1.GetOptions{}); err != nil {
		return 0, err
	} else {
		storage := volume.Spec.Capacity.Storage()
		capacity, _ := storage.AsInt64()
		return capacity, nil
	}
}

@duanhongyi
Copy link
Contributor

I have looked at Nomad's API and it is not the standard K8S API; So the simplest way to fix it is to ignore obtaining the capacity of nomad PVC and directly return the maximum value.

https://developer.hashicorp.com/nomad/api-docs/volumes

If this is feasible, I will submit a PR tomorrow.

@duanhongyi
Copy link
Contributor

#168

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants