Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph-csi pool parameter not passed correctly #8550

Closed
ryanmickler opened this issue Jul 28, 2020 · 15 comments
Closed

ceph-csi pool parameter not passed correctly #8550

ryanmickler opened this issue Jul 28, 2020 · 15 comments
Labels
theme/docs Documentation issues and enhancements theme/storage

Comments

@ryanmickler
Copy link
Contributor

ryanmickler commented Jul 28, 2020

Nomad version

11.3

Operating system and Environment details

os: ubuntu bionic
ceph-csi: quay.io/cephcsi/cephcsi:v2.1.2-amd64

Issue

csi volume container parameter pool apparently not passed to ceph-csi node during volume mount. recieving missing required parameter pool

Potentially problem related to NodeStageVolume in CSI spec.

A quick inspection looks like NodeStageVolume (mounts the volume to a staging path on the node.)
https://github.com/ceph/ceph-csi/blob/47d5b60af8d48574ff6d11ca37dbff5a6f56815b/internal/rbd/nodeserver.go#L116
is calling genVolFromVolumeOptions on line 171
https://github.com/ceph/ceph-csi/blob/47d5b60af8d48574ff6d11ca37dbff5a6f56815b/internal/rbd/nodeserver.go#L171
Then inside that function, we are hitting our missing required parameter pool error here:
https://github.com/ceph/ceph-csi/blob/be9e7cf956c378227ff43e0194410468919766b7/internal/rbd/rbd_util.go#L694

Reproduction steps

Job file

csi-test-job.hcl

job "csi-test" {

    update {
        max_parallel = 3
        stagger = "30s"
    }

    region = "${region}"
    datacenters = ["${datacenter}"]
    type = "service"

    group "group" {

        count = 1
        volume "group_csi_volume" {
            type = "csi"
            read_only = false
            source = "csi-test-0"
        }

        task "task" {

            driver = "docker"

            resources {
                cpu    = 500
                memory = 1024
                network {
                    mbits = 1
                }
            }

           volume_mount {
              volume      = "group_csi_volume"
             destination = "/mnt/foo"
             read_only   = false
            }

            config {
                image = "alpine/latest"
                command = "sleep"
                args = [ "infinity" ]
            }
        }
    }
}

using the terraform nomad_volume provider (had tried using just raw hcl upload to nomad cli, no change)
ceph-csi-volume.tf

data "nomad_plugin" "ceph-csi" {
  plugin_id        = "ceph-csi"
  wait_for_healthy = true
}
resource "nomad_volume" "csi_volume" {
  type            = "csi"
  plugin_id       = data.nomad_plugin.ceph-csi.plugin_id
  volume_id       = "..."
  name            = "..."
  external_id     = "..."
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
  parameters = {
    # String representing a Ceph cluster
    clusterID = "..."
    # Ceph pool into which the RBD image shall be created
    pool = var.pool
  }
  secrets = {
    userID = "..."
    userKey = "..."
  }
  context = {}
}

ceph-csi-plugin-controller-job.hcl

...
    group "ceph-csi" {
        count = 1

        task "plugin" {

            driver = "docker"

            resources {
                cpu    = 200
                memory = 500

                network {
                    mbits = 1
                    port "metrics" {}
                }
            }

            # /etc/ceph-csi-config/config.json
            template {
                data = <<CONFG
[
    {
        "clusterID": "ceph",
        "monitors": [
            "..."
        ]
    }
]
CONFG
                destination   = "new/config.json"
                change_mode   = "restart"
            }

            config {
                image = "quay.io/cephcsi/cephcsi:v2.1.2-amd64"

                args = [
                    "--type=rbd",
                    "--controllerserver=true",
                    "--drivername=rbd.csi.ceph.com",
                    "--logtostderr",
                    "--endpoint=unix://csi/csi.sock",
                    "--metricsport=$${NOMAD_PORT_metrics}",
                    "--nodeid=..."
                ]

                # all CSI node plugins will need to run as privileged tasks
                # so they can mount volumes to the host. controller plugins
                # do not need to be privileged.
                privileged = true

                volumes = [
                    "new/config.json:/etc/ceph-csi-config/config.json",
                ]
            }

            service {
                name = "ceph-csi"
                port = "metrics"
                tags = [ "prometheus" ]
            }

            csi_plugin {
                id        = "ceph-csi"
                type      = "controller"
                mount_dir = "/csi" 
            }
        }
    }

ceph-csi-node.hcl

...
type= "system"

   group "ceph" {

        task "plugin" {

            driver = "docker"

            resources {
                cpu    = 200
                memory = 500

                network {
                    mbits = 1
                    port "metrics" {}
                }
            }

            # /etc/ceph-csi-config/config.json
            template {
                data = <<CONFG
[
    {
        "clusterID": "ceph",
        "monitors": [
            "..."
        ]
    }
]
CONFG
                destination   = "new/config.json"
                change_mode   = "restart"
            }

            config {
                image = "quay.io/cephcsi/cephcsi:v2.1.2-amd64"

                args = [
                    "--type=rbd",
                    # Name of the driver
                    "--drivername=rbd.csi.ceph.com",
                    "--logtostderr",
                    "--nodeserver=true",
                    "--endpoint=unix://csi/csi.sock",
                    # Unique ID distinguishing this instance of Ceph CSI among other instances, 
                    # when sharing Ceph clusters across CSI instances for provisioning
                    "--instanceid=...",
                    # This node's ID
                    "--nodeid=...", 
                    # TCP port for liveness metrics requests (/metrics)
                    "--metricsport=$${NOMAD_PORT_metrics}",
                ]

                # all CSI node plugins will need to run as privileged tasks
                # so they can mount volumes to the host. controller plugins
                # do not need to be privileged.
                privileged = true

                volumes = [
                    "new/config.json:/etc/ceph-csi-config/config.json",
                ]
                mounts = [
                    {
                        type = "tmpfs"
                        target = "/tmp/csi/keys"
                        readonly = false
                        tmpfs_options {
                            size = 1000000 # size in bytes
                        }
                    },
                ]
            }

            service {
                name = "ceph-csi"
                port = "metrics"
                tags = [ "prometheus" ]
            }

            csi_plugin {
                id        = "ceph-csi"
                type      = "node"
                mount_dir = "/csi" 
            }
        }
    }

Nomad Client logs

from ceph-csi node container:
E0728 04:47:20.003465 1 utils.go:163] ID: 23 Req-ID: csi-test-0 GRPC error: rpc error: code = Internal desc = missing required parameter pool

split off from #8212
this specific error first mentioned here #7668 (comment)

@ryanmickler ryanmickler changed the title ceph-csi pool parameter not passed to NodeStageVolume ceph-csi pool parameter not passed correctly Jul 28, 2020
@ryanmickler
Copy link
Contributor Author

@tgross -> I tried to include enough spec to get a complete picture. let me know if more required.

@tgross
Copy link
Member

tgross commented Jul 28, 2020

Thanks @ryanmickler. I'll take a look into this later in the week.

@ryanmickler
Copy link
Contributor Author

ryanmickler commented Aug 5, 2020

I believe i've found the problem.

First, here:

type CSIVolume struct {
we can see both the parameters and context maps are part of the structure. in the hcl, I am passing in pool as a parameter.

Next, here:

req := &csi.NodeStageVolumeRequest{
only the context map is passed through to NodeStageVolumeRequest as VolumeContext, and the parameters block is ignored.

(Its not clear to me what should be a parameter, and what should be in context)

And in the bug you fixed here: 901664f, context should now be passed properly into NodeStageVolume.

This was patched in https://github.com/hashicorp/nomad/releases/tag/v0.12.0-beta2.
So i guess the following should work in nomad v0.12.1?

data "nomad_plugin" "ceph-csi" {
  plugin_id        = "ceph-csi"
  wait_for_healthy = true
}
resource "nomad_volume" "csi_volume" {
  type            = "csi"
  plugin_id       = data.nomad_plugin.ceph-csi.plugin_id
  volume_id       = "..."
  name            = "..."
  external_id     = "..."
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
  parameters = {}
  secrets = {
    userID = "..."
    userKey = "..."
  }
  context = {
     # String representing a Ceph cluster
    clusterID = "..."
    # Ceph pool in which the RBD image exists/shall be created
    pool = var.pool
  }
}

@ryanmickler
Copy link
Contributor Author

Update: well, pushing to Nomad 0.12.1 and passing pool in the context map seems to get past the original error i was having.

Now the issues I'm having are related to how strictly external_id needs to be specifically formatted for the ceph-csi driver. I'll raise these in a new issue if its related to nomad.

I'd be happy to close this, but there's probably some amount of documentation to preserve here as others are likely to hit this error.

@tgross
Copy link
Member

tgross commented Aug 10, 2020

Update: well, pushing to Nomad 0.12.1 and passing pool in the context map seems to get past the original error i was having.

Darn, I didn't dig into this more quickly because I would have caught the version where we'd fixed that. Sorry about that.

I'd be happy to close this, but there's probably some amount of documentation to preserve here as others are likely to hit this error.

Yeah, we don't currently have a great way of documenting the details of what each CSI plugin expects for inputs, and I'd worry about getting stale with what the upstream projects are doing as well. There's clearly some Nomad-specific documentation bits to be had that don't quite belong in Nomad docs but not anywhere else either.

There's an integrations/ directory that seems like it might be a good fit for example volume specs, or maybe the demos/ directory. We're having a bit of an internal discussion about that.

I'm going to close this issue for now but rest assured we're tracking a discussion about what we can do around documentation improvements.

@tgross tgross closed this as completed Aug 10, 2020
@tgross tgross added theme/docs Documentation issues and enhancements and removed stage/needs-investigation labels Aug 10, 2020
@RickyGrassmuck
Copy link
Contributor

RickyGrassmuck commented Aug 10, 2020

@tgross I love the idea of having an Examples directory containing CSI deployment specs. If this could be something that could be contributed to by the public I would be happy to send over the Openstack spec we have along with any others we may use.

@ryanmickler
Copy link
Contributor Author

@RickyGrassmuck
Copy link
Contributor

@ryanmickler yup, that's the one. It's currently broken until 0.12.2 is released but I've tested it with the patch that fixes it and it works great.

@ryanmickler
Copy link
Contributor Author

Perhaps we could get started on a branch where we put our example configs together?

@RickyGrassmuck
Copy link
Contributor

Sure, I would happy to contribute to that!

I have one created for the Cinder CSI Driver already and did some work getting the iscsi CSI driver working which I suspect will work after the 0.12.2 release.

I'd like to hear from @tgross about the internal discussion re: the structure and location of these examples so that we can get this going as smoothly as possible. It's a good amount of work reverse engineering these CSI drivers to be deployed on Nomad (or any orchestrator not named K8's lol) so having documented examples of their configs coming in from the community would be fantastic.

@ryanmickler
Copy link
Contributor Author

ryanmickler commented Aug 12, 2020

Absolutely. Lets wait for feedback.

PS. I also use nomad on openstack, but ive been needing to use CephFS, which I dont think cinder can support.

For that, we'll need to do that Manila CSI plugin (https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/using-manila-csi-plugin.md) which i assume your config will help with.

@tgross
Copy link
Member

tgross commented Aug 12, 2020

Thanks folks! I've opened #8651 to add a directory at ./demo/csi/ to collect these.

@RickyGrassmuck
Copy link
Contributor

@tgross Awesome, appreciate it!

@ryanmickler I Just opened #8662 adding an example for the Cinder CSI driver if you wanted to take a look at it.

@ryanmickler
Copy link
Contributor Author

Great, i just added #8664

This should get people started.

@github-actions
Copy link

github-actions bot commented Nov 3, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
theme/docs Documentation issues and enhancements theme/storage
Projects
None yet
Development

No branches or pull requests

4 participants