Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template rendering through nomad job failing on windows nodes #20034

Closed
pavanrangain opened this issue Feb 26, 2024 · 18 comments
Closed

Template rendering through nomad job failing on windows nodes #20034

pavanrangain opened this issue Feb 26, 2024 · 18 comments

Comments

@pavanrangain
Copy link

pavanrangain commented Feb 26, 2024

Nomad version

1.6.7

Operating system and Environment details

Windows Server 2019

Issue

Template rendering in a nomad job fails on windoes nodes since 1.6.7 (issue seen even in 1.6.8). Issue was not there with version 1.6.6

Reproduction steps

  1. Have a nomad job with a template to render to a file reading from consul KV or use default from file system

Expected Result

Job should get deployed successfully

Actual Result

Job failing with error something as below

Template failed: error rendering "(dynamic)" => "<path removed>//log_config": template render subprocess failed: exit status 0xc0000142 NOTE - actual path removed from error

Job file (if appropriate)

job "test-windoes-job" {
  region      = "global"
  datacenters = ["my-test-cluster"]
  type        = "system"

  group "test-winodows-group" {
    count = 1

    constraint {
      attribute = "${attr.kernel.name}"
      value     = "windows"
    }
  
  task "test-windows-task" {
      driver = "raw_exec"

      artifact {
        source = "<some artifactory url>/filebeat.exe"
      }

      env {
        LOGZIO_CODEC          = "json"
        IP_ADDRESS            = "${attr.unique.network.ip-address}"
        HOSTNAME              = "${attr.unique.hostname}"
        CLUSTER_NAME          = "my-test-cluster"     
      }

      config {
        command = "filebeat.exe"
        args    = ["-c", "local/log_config"]
      }

      template {
        data = <<EOH
          ############################# Filebeat #####################################

          filebeat.inputs:

          - type: log
            paths:
              - ${NOMAD_ALLOC_DIR}/logs/*
            fields:
              logzio_codec: ${LOGZIO_CODEC:'json'}
              token: ${LOGZIO_TOKEN:''}
              clusterName: ${CLUSTER_NAME:''}
            fields_under_root: true
            encoding: utf-8
            ignore_older: 24h
            tail_files: true
            exclude_lines: ${excludeLines:[]}

          #The following processors are to ensure compatibility with version 7
          processors:
          - rename:
              fields:
              - from: "agent"
                to: "beat_agent"
              ignore_missing: true
          - rename:
              fields:
              - from: "log.file.path"
                to: "source"
              ignore_missing: true

          ############################# Output ##########################################
          output:
            logstash:
              enabled: false
              hosts: ["listener.logz.io:5015}"]
        EOH

        destination = "local/log_config"
        change_mode = "restart"
      }

      resources {
        cpu    = 100 # Mhz
        memory = 100 # MB
      }
    }
  }
}

Nomad Server logs (if appropriate)

Nothing relevant

Nomad Client logs (if appropriate)

Just shows same error
Template failed: error rendering "(dynamic)" => "<path removed>//log_config": template render subprocess failed: exit status 0xc0000142

Observation:

Issue may be with this change that went into 1.6.7. There is no issue seen on linux node wrt to template rendering. Issue is only on windows nodes

@meowtini
Copy link

I'm experiencing a similar problem. Template rendering fails on the following template:

template {
  data        = <<EOH
    USERNAME="{{ with secret "secret/path" }}{{ .Data.data.username }}{{ end }}"
    PASSWORD="{{ with secret "secret/path" }}{{ .Data.data.password }}{{ end }}"
  EOH
  destination = "secrets/vault.cred"
  env         = true
}

I didn't have any issues with this template on 1.7.1, but after upgrading to 1.7.5 I started to get the template render subprocess failed: exit status 0xc0000142 error.

@hardselius
Copy link

I've experienced the same issue in both 1.7.5 and 1.7.4. 1.7.2 is working.

@jrasell
Copy link
Member

jrasell commented Feb 27, 2024

Hi @meowtini @hardselius and @hardselius; thanks for raising and contributing to this issue. I believe this is caused by the changes introduced within #19888 and therefore I would ask for some additional information to help us to understand the problem which our testing missed.

  • client logs from when the task is placed until after the template render fails
  • details of the Nomad client binary permissions
  • details on the user that is being used to run the Nomad client binary
  • permission details on the template destination as well as the parent path

Thanks.

@pavanrangain
Copy link
Author

@jrasell Pls find info you had asked

  • client logs from when the task is placed until after the template render fails
    nomad-windows-agent-logs.txt

  • details of the Nomad client binary permissions
    nomad_file_permission

  • details on the user that is being used to run the Nomad client binary
    Its windows SYSTEM user
    permission details on the template destination as well as the parent path
    SYSTEM usr has full permission
    nomad_folder_permissions

Again this is seen only on Windows server nodes and not on linux nodes (atleast in our case)

@tgross
Copy link
Member

tgross commented Mar 1, 2024

Again this is seen only on Windows server nodes and not on linux nodes (atleast in our case)

Yeah, the security update in 1.6.7 has significantly different implementation on Windows than on any other operating system. We had to implement AppContainers rather than just chrooting the rendering subprocess.

Unfortunately it looks like the client logs you've provided here are at info-level only so we may be missing some context. Here's the only relevant bits:

{"@level":"info","@message":"(runner) creating new runner (dry: false, once: false)","@module":"agent","@timestamp":"2024-02-28T04:59:42.152055-05:00"}
{"@level":"info","@message":"(runner) creating watcher","@module":"agent","@timestamp":"2024-02-28T04:59:42.153260-05:00"}
{"@level":"info","@message":"(runner) starting","@module":"agent","@timestamp":"2024-02-28T04:59:42.153828-05:00"}
{"@level":"error","@message":"exit status 0xc0000142","@module":"agent","@timestamp":"2024-02-28T04:59:42.204358-05:00"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-02-28T04:59:42.204358-05:00","alloc_id":"de978c15-f7fe-ec5a-c386-a205a79ec2d5","failed":true,"msg":"Template failed: error rendering \"(dynamic)\" =\u003e \"C:\\\\ProgramData\\\\nomad\\\\alloc\\\\de978c15-f7fe-ec5a-c386-a205a79ec2d5\\\\control-plane-logging-task\\\\local\\\\log_config\": template render subprocess failed: exit status 0xc0000142","task":"control-plane-logging-task","type":"Killing"}

According to the MSFT error reference documentation (PDF), the exit code we're getting here is STATUS_DLL_INIT_FAILED. Which implies we weren't able to open the rendering process because we couldn't open a DLL somewhere. This doesn't make a whole lot of sense, as the rendering process doesn't load any external DLLs. So I'm still investigating as to how we could be hitting this error.

It shouldn't make a difference, but just to help me eliminate possibilities, are you running Nomad as a Windows Service using the instructions in Register Nomad with Windows, or some other way?

@pavanrangain
Copy link
Author

trace_log_windows_client.json
Pls find the trace logs. The nomad is registered as windows native service
nomad service properties

@tgross
Copy link
Member

tgross commented Mar 15, 2024

Thanks @pavanrangain! Sorry to get back so slowly on this. I just wanted to pop in and say we haven't forgotten you but I've been swamped with a couple of other items. We're going to hand this issue off to @angrycub who helped me work on the security fix that's at the heart of this, and he'll start looking at this once he's wrapped up his current task. Thanks for your patience!

@DTTerastar
Copy link

Any progress on this? We're still stuck on 1.7.3 in order to avoid this issue.

@tgross
Copy link
Member

tgross commented Apr 16, 2024

Hi @DTTerastar! We have a solid idea of the problem, which is a difference in ambient credentials between running Nomad as a Windows service vs otherwise (which is what all our tests did! 🤦). It's taking longer than we'd like to figure out the solution however. We'll update this issue when we have more information.

@mikedvinci90
Copy link

I am having same issue on ubuntu for many jobs.

Template failed: error rendering "(dynamic)" => "/etc/nomad.d/data/alloc/44a00841-7dc1-bef9-f3a4-98f456be2d8f/a1beb0eb-1d27-41b6-9324-3a5f15642a25/local/.env": template render subprocess failed: signal: killed

@tgross
Copy link
Member

tgross commented May 6, 2024

@mikedvinci90 if your issue isn't on Windows, please open a new issue for that. The isolation mechanism is very different between the two OS. Debugging this is likely possible on Linux without the patch we're working on (slowly!) for Windows.

@thfai2000
Copy link

thfai2000 commented May 12, 2024

Hi @tgross . I'm facing the similar issue.
Really appreciate if you can help. Thanks.

Nomad version

1.7.7

Operating system and Environment details

Windows 11 Home
OS build: 22621.3447

Issue:

The template rendering works fine if it was running the Nomad binary by Powershell (Administrator Mode) but it fails in running as Window Service.

On the Web UI, I saw the error message when I used Nomad Window Service.
"Task hook failed: template: failed to read template: exit status 0xc0000142"
nomad.log

But actually the template file is there.
image

Reproduction steps

Just use "sc.exe create ..." to create Window Service and user "Local System" as the running user.
Run a simple job with "template block"

My server configuration

datacenter = "dc0"
name = "nomad-on-win11"

data_dir  = "D:\\hashicorp\\nomad\\data"
log_file = "D:\\hashicorp\\nomad\\log\\nomad.log"
log_level = "DEBUG"
bind_addr = "0.0.0.0"

server {
  # license_path is required for Nomad Enterprise as of Nomad v1.1.1+
  #license_path = "/etc/nomad.d/license.hclic"
  enabled          = true
  bootstrap_expect = 1

  # This is the IP address of the first server provisioned
  server_join {
    # nslookup "$(hostname).local"
    retry_join = ["127.0.0.1:4648"]
    retry_max = 3
    retry_interval = "15s"
  }
}

client {
  enabled = true
  servers = ["127.0.0.1"]
  # use command to find the interface name "netsh int ipv4 show interfaces"
  network_interface = "Loopback Pseudo-Interface 1"
}

plugin "raw_exec" {
    config {
      enabled = true
    }
}

My Job

job "fo-component" {


  group "example" {

    task "service-task" {
      artifact {
        source      = "https://github.com/thfai2000/jenkins-pipelines/releases/download/1.0/artifact-1.0.zip"
        destination = "local/app"
      }

      template {
        source        = "local/app/config.xml.tpl"
        destination   = "local/app/config.xml"
      }


      driver = "raw_exec"
      config {
        command = "local/app/bin/Release/net8.0/win-x64/.net.exe"
      }

    }

  }

}

The Window Service Properties

image
The executable:
D:\hashicorp\nomad\bin\nomad.exe agent -config=D:\hashicorp\nomad\config\nomad.hcl
image

@tgross
Copy link
Member

tgross commented May 13, 2024

@thfai2000 for now the only solution is to disable the file sandbox: https://developer.hashicorp.com/nomad/docs/configuration/client#disable_file_sandbox This sounds much worse than is really is, as you're already using raw_exec and the task itself can bypass the sandbox. We're still working on trying to figure out a better long term solution, including engaging with our partners at the OS vendor.

@thfai2000
Copy link

hi @tgross

Thanks for your advice. It works now after I use "disable_file_sandbox = true" in my server configuration file.
Good to hear that your team is trying to figure out a better solution and really appreciated your team's effort. thanks.

client {
  enabled = true
  template {
    disable_file_sandbox = true
  }
  servers = ["127.0.0.1"]
  # use command to find the interface name "netsh int ipv4 show interfaces"
  network_interface = "Loopback Pseudo-Interface 1"
}

plugin "raw_exec" {
    config {
      enabled = true
    }
}

@tgross
Copy link
Member

tgross commented May 17, 2024

Not the same issue but deeply interrelated: #20585

@gscho
Copy link

gscho commented May 18, 2024

Disabling the file sandbox also worked for us but would to see a proper fix for this.

@pkazmierczak
Copy link
Contributor

Hi @pavanrangain, we just merged 2 changes that will remedy this problem. Nomad 1.8.2 will no longer sandbox template rendering on Windows, and to address the security aspect (which is only relevant for running Docker with Process Isolation as ContainerAdmin) it will perform checks in the Docker driver. I will close the issue for now, feel free to re-open if the problem persists.

Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests