Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drtprod: support for remote execution #140016

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nameisbhaskar
Copy link
Contributor

@nameisbhaskar nameisbhaskar commented Jan 29, 2025

currently drtprod execute runs the execution of the YAML file locally. This is an issue in cases where we have long running deployments. So, to solve this problem, this change provides a new option to run the execution on a remote machine. This machine is created and all the dependent files are copied across to that machine. All executions are done in the remote machines.
In addition, I have added another feature to define the dependent scripts. This becomes more important as the impact of failure in remote execution is more as compared to running locally. So, we have to ensure that the required files are present.

Epic: None
Release Note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nameisbhaskar nameisbhaskar force-pushed the remote-drtprod branch 8 times, most recently from b7b9257 to a9df52a Compare January 30, 2025 11:42
@nameisbhaskar nameisbhaskar marked this pull request as ready for review January 31, 2025 07:06
@nameisbhaskar nameisbhaskar requested a review from a team as a code owner January 31, 2025 07:06
@nameisbhaskar nameisbhaskar requested review from sambhav-jain-16 and shailendra-patel and removed request for a team January 31, 2025 07:06
currently drtprod execute runs the execution of the YAML file locally. This is an issue in cases where we have long running deployments. So, to solve this problem, this change provides a new option to run the execution on a remote machine. This machine is created and all the dependent files are copied across to that machine. All executions are done in the remote machines.
In addition, I have added another feature to define the dependent scripts. This becomes more important as the impact of failure in remote execution is more as compared to running locally. So, we have to ensure that the required files are present.

Epic: None
Release Note: None
@shailendra-patel
Copy link
Contributor

My understanding of the current implementation is as follows. Please correct me if there are any gaps:

We are creating a YAML-like configuration for the remote executor in the code, then executing it to deploy the remote machine. After that, we copy all the relevant files to the remote machine. Finally, we create a systemd service to run the actual execution of drtprod for cluster provisioning using the following command:
drtprod execute pkg/cmd/drtprod/configs/test.yaml --useRemoteDeployer=true

This approach adds a lot of unnecessary code, such as preparing the configuration for the remote deployer in the code, which is hardcoded and reduces the flexibility of choosing another cloud, machine type, etc. Additionally, the added complexity makes it harder to reason about the code.

We can simplify the usage by leveraging existing code with some enhancements in the YAML configuration file. We can have options like is_remote_execution and yamlFile. If these options are present in the YAML file, we simply need to do the following:

  1. Create a remote machine.
  2. Copy the YAML files.
  3. Copy the dependent target files to the remote machine.
  4. Start the systemd service, which executes the command for cluster provisioning.

Sample YAML file for this approach:

# YAML for creating and configuring the drt-chaos and workload-chaos clusters. This also configures Datadog.
environment:
  ROACHPROD_GCE_DEFAULT_SERVICE_ACCOUNT: [email protected]
  ROACHPROD_DNS: drt.crdb.io
  ROACHPROD_GCE_DNS_DOMAIN: drt.crdb.io
  ROACHPROD_GCE_DNS_ZONE: drt
  ROACHPROD_GCE_DEFAULT_PROJECT: cockroach-drt
  CLUSTER: drt-cluster-monitor

targets:
  - target_name: $CLUSTER
    is_remote_execution: true
    yamlFile: pkg/cmd/drtprod/configs/drt_chaos.yaml
    steps:
      - command: create
        args:
          - $CLUSTER
        flags:
          clouds: gce
          gce-zones: "us-central1-a"
          nodes: 1
          gce-machine-type: n2-standard-2
          username: drt
          lifetime: 8760h
          gce-image: "ubuntu-2204-jammy-v20240319"
        on_rollback:
          - command: destroy
            args:
              - $CLUSTER
      - command: sync
        flags:
          clouds: gce

dependent_file_locations:
  - artifacts/roachprod
  - artifacts/roachtest
  - pkg/cmd/drtprod/scripts/setup_dmsetup_disk_staller
  - pkg/cmd/drtprod/scripts/setup_datadog_cluster
  - pkg/cmd/drtprod/scripts/setup_datadog_workload
  - pkg/cmd/drtprod/scripts/tpcc_init.sh
  - pkg/cmd/drtprod/scripts/generate_tpcc_run.sh
  - pkg/cmd/drtprod/scripts/generate_kv_run.sh
  - pkg/cmd/drtprod/scripts/generate_tpcc_drop.sh

cobraCmd.Flags().BoolVarP(&useRemoteDeployer,
"remote", "r", false, "runs the deployment remotely in the deployment node. This is "+
"useful for long running deployments as the local execution completes soon. A \"deployment_name\" must be mentioned "+
"in the YAML for running a deployment remotely.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we have a more precise description of the flag something like below
This flag enables the deployment of the drtprod cluster from a remote machine. It is useful for long-running deployments, as the local execution completes quickly. Ensure that a "deployment_name" is specified in the YAML file to enable remote deployment.


// GetYamlProcessor creates a new Cobra command for processing a YAML file.
// The command expects a YAML file as an argument and runs the commands defined in it.
func GetYamlProcessor(ctx context.Context) *cobra.Command {
displayOnly := false
useRemoteDeployer := false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename variable.
useRemoteDeployment

if err != nil {
// drtprod is needed in the path to run yaml commands
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wrap the error to give more context to the user on failure.

return errors.Wrap(err, "drtprod executable is required in the path to run yaml commands")

Targets []target `yaml:"targets"` // List of target clusters with their steps
DependentFileLocations []string `yaml:"dependent_file_locations,omitempty"` // location of all the dependent files - scripts/binaries
DeploymentName string `yaml:"deployment_name,omitempty"` // name provided to the deployment
Environment map[string]string `yaml:"environment"` // Environment variables to set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more context around DeploymentName in comment.

// either locally or in a remote VM based on the flag "useRemoteDeployer".
// If the useRemoteDeployer is set to true, a new VM is created and the current YAML is copied across along
// with the required scripts and binaries and execution is run there. This is useful for long-running deployments
// where the local deployment just delegates the execution of the deployment to the remote machine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: throughout can we follow this best practice for code commenting.

Use full sentences. Capital letter at begin, period at the end, with at least a subject and a verb.

if _, err := os.Stat(drtprodLocation); err != nil {
// if drtprod binary is not available in artifacts, we cannot proceed as this is needed for executing
// the YAML remotely
return fmt.Errorf("%s must be available for executing remotely", drtprodLocation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use erros.Wrapf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants