Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce DCN DT #392

Merged

Conversation

sbekkerm
Copy link
Contributor

@sbekkerm sbekkerm commented Sep 9, 2024

This PR introduces DCN VA, which builds upon the HCI VA architecture and is designed for multi-site deployment.

In addition to the regular configuration files, this PR includes Jinja templates for generating values.yaml and service-values.yaml files. These templates are essential for Zuul job execution, allowing for the creation of site-specific configuration files multiple times for each DCN site.

Copy link

openshift-ci bot commented Sep 9, 2024

Hi @sbekkerm. Thanks for your PR.

I'm waiting for a openstack-k8s-operators member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@fultonj fultonj self-requested a review September 11, 2024 12:15
examples/va/dcn/README.md Outdated Show resolved Hide resolved
automation/vars/dcn.yaml Outdated Show resolved Hide resolved
@fultonj
Copy link
Contributor

fultonj commented Sep 16, 2024

@sbekkerm I see two changes needed up front.

  1. Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:

https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages

It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).

https://github.com/fultonj/dcn?tab=readme-ov-file#steps

No code should be required to implement what I'm talking about for this request. Just written instructions.

  1. VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

@sbekkerm
Copy link
Contributor Author

@sbekkerm I see two changes needed up front.

  1. Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:

https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages

It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).

https://github.com/fultonj/dcn?tab=readme-ov-file#steps

No code should be required to implement what I'm talking about for this request. Just written instructions.

  1. VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

  1. It contains the instructions. All four DCN steps are almost the same as HCI VA, except for the post-nova actions, which are already covered here: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md#finalize-nova-computes
    Additionally, it mentions that Steps 3, 4, and the Ceph installation need to be executed for each DCN site

The main difference between VA and HCI is in the values.yaml and service-values.yaml files. For example, the nncp values.yaml contains the configuration necessary for spine and leaf: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane/nncp/values.yaml#L18

and the post-ceph service-values.yaml contains Glance Multi Store configuration:
https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/service-values.yaml#L117

  1. Why should we use DT instead of VA?

@fultonj
Copy link
Contributor

fultonj commented Sep 16, 2024

@sbekkerm I see two changes needed up front.

  1. Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:
https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages
It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).
https://github.com/fultonj/dcn?tab=readme-ov-file#steps
No code should be required to implement what I'm talking about for this request. Just written instructions.

  1. VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

  1. It contains the instructions. All four DCN steps are almost the same as HCI VA, except for the post-nova actions, which are already covered here: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md#finalize-nova-computes
    Additionally, it mentions that Steps 3, 4, and the Ceph installation need to be executed for each DCN site

The main difference between VA and HCI is in the values.yaml and service-values.yaml files. For example, the nncp values.yaml contains the configuration necessary for spine and leaf: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane/nncp/values.yaml#L18

and the post-ceph service-values.yaml contains Glance Multi Store configuration: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/service-values.yaml#L117

Yes, I understand but I still need clear written instructions so I (or anyone else) can reproduce.

I want to collaborate with you on this and reproduce the results in my environment so I can find and fix bugs. I think the READMEs are missing too much and filling them in will help other engineers and the docs team. From a very high level it's so we can have https://en.wikipedia.org/wiki/Reproducibility

  1. Why DT?

https://github.com/openstack-k8s-operators/architecture/blob/main/examples/dt/README.md

I don't think this should be an update to an existing DT, it should be a new DT, but it's a DT, not a VA.

I wouldn't want to hand the field the Jinja2 files. This is something we do for our CI but not yet ready to be a full blown VA we could hand to someone in the field. Maybe it can evolve into a VA in the future. For now, in order to merge what you have I think it should be a DT.

@sbekkerm
Copy link
Contributor Author

@sbekkerm I see two changes needed up front.

  1. Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:
https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages
It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).
https://github.com/fultonj/dcn?tab=readme-ov-file#steps
No code should be required to implement what I'm talking about for this request. Just written instructions.

  1. VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

  1. It contains the instructions. All four DCN steps are almost the same as HCI VA, except for the post-nova actions, which are already covered here: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md#finalize-nova-computes
    Additionally, it mentions that Steps 3, 4, and the Ceph installation need to be executed for each DCN site

The main difference between VA and HCI is in the values.yaml and service-values.yaml files. For example, the nncp values.yaml contains the configuration necessary for spine and leaf: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane/nncp/values.yaml#L18
and the post-ceph service-values.yaml contains Glance Multi Store configuration: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/service-values.yaml#L117

Yes, I understand but I still need clear written instructions so I (or anyone else) can reproduce.

I want to collaborate with you on this and reproduce the results in my environment so I can find and fix bugs. I think the READMEs are missing too much and filling them in will help other engineers and the docs team. From a very high level it's so we can have https://en.wikipedia.org/wiki/Reproducibility

  1. Why DT?

https://github.com/openstack-k8s-operators/architecture/blob/main/examples/dt/README.md

I don't think this should be an update to an existing DT, it should be a new DT, but it's a DT, not a VA.

I wouldn't want to hand the field the Jinja2 files. This is something we do for our CI but not yet ready to be a full blown VA we could hand to someone in the field. Maybe it can evolve into a VA in the future. For now, in order to merge what you have I think it should be a DT.

The README contains all the steps to reproduce the environment. Could you please clarify what specifically is unclear?

  1. These steps for deploying the control plane:
    https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane.md
  2. These steps for preparing nodes for ceph installation:
    https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-pre-ceph.md
  3. These steps for configuring nodes after ceph installation:
    https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md

The Jinja templates are not involved in the deployment process, as they are only used by CI to set the "CHANGEME" parameters in values.yaml and service-valiues.yaml

@fultonj
Copy link
Contributor

fultonj commented Sep 16, 2024

@sbekkerm

Please move this from the va directory to the dt directory.

I am putting it in my backlog to go through your README line by line and attempt to reproduce what you have deployed and I'll ask you clarifying questions along the way which will point out what is incomplete in the READMEs.

@sbekkerm
Copy link
Contributor Author

@fultonj

Moved from va to dt directory as requested.
Also added a high-level diagram to the README for more clarity. Let me know if you have any questions.

@sbekkerm sbekkerm changed the title Introduce DCN VA Introduce DCN DT Sep 17, 2024
Copy link
Contributor

@krcmarik krcmarik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am suggesting some changes based on what we had in 17.1 openstack services configs and/or to make tempest tests pass (I've managed to make compute and volume tempest suites to pass all tests with the proposed changes which I applied manually)

examples/dt/dcn/service-values.yaml Outdated Show resolved Hide resolved
examples/dt/dcn/service-values.yaml Outdated Show resolved Hide resolved
examples/dt/dcn/service-values.yaml.j2 Outdated Show resolved Hide resolved
examples/dt/dcn/service-values.yaml Show resolved Hide resolved
examples/dt/dcn/service-values.yaml Show resolved Hide resolved
examples/dt/dcn/values.yaml.j2 Outdated Show resolved Hide resolved
examples/dt/dcn/values.yaml.j2 Outdated Show resolved Hide resolved
examples/dt/dcn/service-values.yaml.j2 Outdated Show resolved Hide resolved
examples/dt/dcn/service-values.yaml.j2 Outdated Show resolved Hide resolved
examples/dt/dcn/service-values.yaml.j2 Outdated Show resolved Hide resolved
@fultonj
Copy link
Contributor

fultonj commented Oct 1, 2024

@sbekkerm Could this be rebased?

On Thursday Oct 3rd I'll do a deployment to test this proposed patch and leave feedback (sorry for the delay).

Copy link
Contributor

@krcmarik krcmarik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more suggestion for the CR

examples/dt/dcn/service-values.yaml Show resolved Hide resolved
examples/dt/dcn/service-values.yaml.j2 Outdated Show resolved Hide resolved
@krcmarik
Copy link
Contributor

krcmarik commented Oct 3, 2024

@sbekkerm @fultonj One more question, I am not sure what the final decision was but do we want to include multicell (cell per site) into this DCN DT? It's just adding some extra configuration for the control plane which I can do.

@sbekkerm
Copy link
Contributor Author

sbekkerm commented Oct 3, 2024

@sbekkerm @fultonj One more question, I am not sure what the final decision was but do we want to include multicell (cell per site) into this DCN DT? It's just adding some extra configuration for the control plane which I can do.

In my view, it's a great suggestion. This feature will be beneficial in cases where AZ sites contain many compute nodes. Additionally, we can use this when we spread the control plane across AZs.

@fultonj
Copy link
Contributor

fultonj commented Oct 3, 2024

@sbekkerm @fultonj One more question, I am not sure what the final decision was but do we want to include multicell (cell per site) into this DCN DT? It's just adding some extra configuration for the control plane which I can do.

In my view, it's a great suggestion. This feature will be beneficial in cases where AZ sites contain many compute nodes. Additionally, we can use this when we spread the control plane across AZs.

I agree.

My understanding is that there is agreement and that we can go ahead with using multicell in this DT.

That said there might be people who don't use multicell but I still think we should test this with multicell. If there was a way to run the resultant job with our without multicell by flipping one switch that would be really nice but that might be too much to ask in this first edition.

@fultonj
Copy link
Contributor

fultonj commented Oct 3, 2024

@sbekkerm @fultonj One more question, I am not sure what the final decision was but do we want to include multicell (cell per site) into this DCN DT? It's just adding some extra configuration for the control plane which I can do.

@krcmarik do you want to try that in a stacked patch.

i.e. create a new PR which has been rebased on this one so that we can merge this one sooner.

examples/dt/dcn/service-values.yaml Outdated Show resolved Hide resolved
examples/dt/dcn/service-values.yaml Outdated Show resolved Hide resolved
examples/dt/dcn/service-values.yaml.j2 Outdated Show resolved Hide resolved
examples/dt/dcn/service-values.yaml.j2 Outdated Show resolved Hide resolved
@hjensas
Copy link
Contributor

hjensas commented Oct 8, 2024

I am not a fan of the introduction of jinja2 templates - and the custom playbook to execute them.
Can we not instead define multiple nodesets, and use the "standard" kustomize_deploy role to apply this?
There are atleast two other PR's proposed that use multiple nodeset approach.

@sbekkerm
Copy link
Contributor Author

sbekkerm commented Oct 9, 2024

I am not a fan of the introduction of jinja2 templates - and the custom playbook to execute them. Can we not instead define multiple nodesets, and use the "standard" kustomize_deploy role to apply this? There are atleast two other PR's proposed that use multiple nodeset approach.

@hjensas
I get what you mean about preferring to use multiple nodesets "standard" role, but the DCN setup is a bit more complex than that.

It’s not just about handling multiple nodesets with spine and leaf—it’s about managing multiple sites, each with its own configuration for services like Glance and Cinder. Plus, each site needs its own Ceph cluster, which adds another layer of complexity. That’s why we went with Jinja2 templates and a custom playbook for this.

But I do agree with you that eventually integrating this into the ci-framework would be ideal

examples/dt/dcn/service-values.yaml Outdated Show resolved Hide resolved
examples/dt/dcn/service-values.yaml Outdated Show resolved Hide resolved
examples/dt/dcn/service-values.yaml Outdated Show resolved Hide resolved
examples/dt/dcn/service-values.yaml Outdated Show resolved Hide resolved
examples/dt/dcn/README.md Outdated Show resolved Hide resolved
@fultonj
Copy link
Contributor

fultonj commented Oct 22, 2024

@sbekkerm I see two changes needed up front.

  1. Written Instructions
  2. VA vs DT

These changes have been addressesd.

rbd_cluster_name = az2
backend_availability_zone = az2
glance:
customServiceConfig: |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This customServiceConfig at glance level seems redundant to me.
Given you're defining the backend for each API under glanceAPIs (including az0), you can basically remove from L69 to L79 [1].
You inherit the top-level customServiceConfig only if it's not defined for that specific instance, and this does not seem the case.
@fultonj do you want to update the glance config?

[1] https://github.com/openstack-k8s-operators/glance-operator/blob/main/controllers/glance_controller.go#L890

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it doesn't block your deployment because as per the link I mentioned earlier the right config ends up being on the right instance, but I think this is simply redundant and should be patched.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but per our discussion I will submit a follow up patch to address this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I'm ok to move forward with this.

@fmount fmount self-requested a review October 23, 2024 10:33
Copy link
Contributor

@fultonj fultonj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +2 on merging this.

Copy link
Contributor

@fmount fmount left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Oct 24, 2024
Copy link
Contributor

@abays abays left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Copy link
Contributor

@fultonj fultonj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

Copy link

openshift-ci bot commented Oct 24, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abays, fmount, fultonj, sbekkerm

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Build succeeded (gate pipeline).
https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/d2ccfc4d99f44bfe825442753996e878

✔️ noop SUCCESS in 0s
✔️ rhoso-architecture-validate-dcn SUCCESS in 3m 28s

@softwarefactory-project-zuul softwarefactory-project-zuul bot merged commit 2b53bd2 into openstack-k8s-operators:main Oct 24, 2024
9 checks passed
@fultonj
Copy link
Contributor

fultonj commented Oct 25, 2024

/cherrypick 18.0-fr1

@openshift-cherrypick-robot

@fultonj: new pull request created: #426

In response to this:

/cherrypick 18.0-fr1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@fultonj
Copy link
Contributor

fultonj commented Oct 25, 2024

@sbekkerm @fultonj One more question, I am not sure what the final decision was but do we want to include multicell (cell per site) into this DCN DT? It's just adding some extra configuration for the control plane which I can do.

@krcmarik do you want to try that in a stacked patch.
i.e. create a new PR which has been rebased on this one so that we can merge this one sooner.

We are using cells now. We can do follow ups for further changes.

To clarify, we have cells, but we're not using cells per site. I think we should address that in a follow up patch.

@sbekkerm sbekkerm deleted the dcn branch October 25, 2024 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants