Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot run etcd as a Windows service #10433

Closed
jasper-d opened this issue Jan 25, 2019 · 24 comments
Closed

Cannot run etcd as a Windows service #10433

jasper-d opened this issue Jan 25, 2019 · 24 comments

Comments

@jasper-d
Copy link

Repro:

  1. Extract etcd binaries to C:\etcd
  2. mkdir C:\etcd\data
  3. Grant "Full Access" (rwx) to "NT AUTHORITY\Local Service" on C:\etcd
  4. Start an elevated command prompt
  5. Install the service: sc create etcd binpath= "C:\etcd\etcd.exe --data-dir C:\etcd\data" obj= "NT AUTHORITY\Local Service
  6. Start the service: net start etcd

Expected:

ectd service starts

Actual:

  • etcd service start times out

  • Windows event log shows two errors:

    • Service Control Manger: "A timeout was reached (120000 milliseconds) while waiting for the etcd service to connect."
    • Service Control Manger: "The etcd service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion."
  • C:\etcd\data contains the following files:

    FullName, Length
    C:\etcd\data\member, 1
    C:\etcd\data\member\snap, 1
    C:\etcd\data\member\wal, 1
    C:\etcd\data\member\snap\db, 32768
    C:\etcd\data\member\snap\db.lock, 0
    C:\etcd\data\member\wal\0.tmp, 64000000
    C:\etcd\data\member\wal\0000000000000000-0000000000000000.wal, 64000000

Workaround:

  • None

Additional information:

Running etcd.exe from the command prompt works fine. However, etcd service won't even run as "LocalSystem" (that's the "Do whatever you want" built-in account).
I was able to reproduce the issue on multiple Win10 machines.
I assume that it has something to do with the working directory (that's at least the most likely cause from my experience if an application can be started from cmd.exe but not as a service). The default working directory for a Windows service is C:\Windows\system32 (which is locked down for good reasons).

Environment:

  • Windows 10.0.16299 Build 16299 x64
  • etcd.exe --version
    etcd Version: 3.3.11
    Git SHA: 2cf9e51
    Go Version: go1.10.7
    Go OS/Arch: windows/amd64
@hexfusion
Copy link
Contributor

hexfusion commented Jan 26, 2019

Hi @jasper-d Looks like this is a known issue maybe you can help us to fix it? I do not have access to or have expertise with Windows machines to test so your help would be greatly appreciated.

ref:
#3351
#3410

@jasper-d
Copy link
Author

@hexfusion I'll look into it but it may take a few days becasue I have little to no experience with Go.

@hexfusion
Copy link
Contributor

hexfusion commented Jan 28, 2019

@jasper-d we can assist with the go if you can assist with windows testing. Take a look at the old existing PR above and see if it gives you any hints. Basically can you review the existing research that was done and see what is the proper method for managing a Windows service with golang? Maybe it is is the same in which case we can reuse that PR as a starting point.

1.) review existing PRs and issues.
2.) research current best practices for Windows service and golang

From here we have a good place to start, this will move it forward without code. Thanks!

@jasper-d
Copy link
Author

@hexfusion I dont mind trying out some things and learning some go in the process. I got the PR working with some minor changes and will take a look at some go services that run on windows (i.e. gnatsd, Elastic Filebeat) to see how they do it.

@hexfusion
Copy link
Contributor

Hi @jasper-d just checking in do you have any questions?

@jasper-d
Copy link
Author

jasper-d commented Feb 7, 2019

@hexfusion Not yet, I was occupied with some more pressing issues. I probably wont have time to look into it before next weekend.

@tskarman
Copy link

tskarman commented Feb 7, 2019

Just wanted to let you know that this is not a general issue.
I am running etcd and the etcd grpc proxy as a Windows service across a wide variety of Windows machines (Windows Server 2012 R2, Windows Server 2016, Windows Server 2019, Windows 10) and have been for >6 months and across various etcd versions.

Of note:

  • I am also running them under the Local System account. I am not sure whether I tried running them with virtual service accounts
  • I am managing the installation and execution via nssm and not sc. I never tried installing them with sc so not sure whether that makes a difference.

I am using the pre-release version 2.2.4-101 linked on this page: https://nssm.cc/download
Not sure whether the normal version would work.

With nssm I am specifying the etcd directory as the startup directory. Since you mentioned working directories, that might make a difference.

I am not doing anything too special parameter-wise. I am specifying various bindings explicitly, though. And also I am not binding anything to localhost, 127.0.0.1 or 0.0.0.0. Not that that should make a difference, though. The service usually fails to start very promptly if a port/binding is in use.

Example:

etcd --name etcd3 --client-cert-auth=true --listen-client-urls https://1.2.3.4:2379 --advertise-client-urls https://etcd3.example.com:2379 --listen-peer-urls https://1.2.3.4:2380 --initial-advertise-peer-urls https://etcd3.example.com:2380 --initial-cluster-token etcd-cluster-1 --discovery-srv example.com --initial-cluster-state existing --peer-cert-file C:\somepath\member3.pem --peer-key-file C:\somepath\member3-key.pem --peer-trusted-ca-file C:\somepath\ca.pem --cert-file C:\somepath\member3.pem --key-file C:\somepath\member3-key.pem --trusted-ca-file C:\somepath\ca.pem

@haroldHT
Copy link
Contributor

haroldHT commented Feb 8, 2019

@jasper-d if u want etcd work in win, u must make it become win service.
#3410 It could be achieved in windows but can not work in linux
I want to make it better

@haroldHT
Copy link
Contributor

haroldHT commented Feb 8, 2019

@tskarman
I can reproduce jasper-d's question when I use sc in windows 10.

@hexfusion
Can I create a new PR ?
Both work in windows and linux.

@jasper-d
Copy link
Author

jasper-d commented Feb 8, 2019

@haroldHT #3410 Does not gracefully stop etcd and has some other flaws. The reason that etcd does not work as a service is that it doesn't communicate with SCM. #3410 adds some basic support for it (using golang's svc package which essentially all windows services written in go use). Properly handling stop/shutdown as well as redirecting stdout/stderr (i.e enabling log output) requires some more work. You're welcome to contribute of course. 🙂

@tskarman NSSM does a lot of stuff (i.e. stdout/stderr redirection). I reckon it does communicate with SCM as well which would explain why you can start etcd as a service when using it. However, relying on hackish 3rd party tools is a workaround, not a solution from my point of view.

@tskarman
Copy link

tskarman commented Feb 8, 2019

@jasper-d yes, I completely understand and now am interested in a solution as well. let me know when I can help you. My go is rusty and not a priority for me right now, but I could help with testing across the aforementioned operating systems.

That being said. I run etcd like this in production and have not encountered any reliability or responsivity or service signalling issue. So I would recommend this as a workaround for the time being.

@haroldHT
Copy link
Contributor

haroldHT commented Feb 9, 2019

@jasper-d #10460

But I am confused with the log output.
The log(i.e etcd_err.log,etcd_out.log) position I can use cfg.ec.Dir,
But the output of log whether etcd have some utils so I can use it.

And I do not know how to connect etcd's log to service.
Thanks.

@hexfusion
Copy link
Contributor

@hexfusion
Can I create a new PR ?
Both work in windows and linux.

@haroldHT thanks for showing interest in resolving this. Please work with @jasper-d and @tskarman on a solution then let me know if you have any questions.

@haroldHT
Copy link
Contributor

@hexfusion Sorry,I always @ wrong people,
Etcd have so many kind of log that it make me confuse, I need to spend a lot of time to understand.

@jingyih
Copy link
Contributor

jingyih commented Mar 4, 2019

cc @wenjiaswe

@wenjiaswe
Copy link
Contributor

Thank you all for helping out! @haroldHT also contacted me offline and showed interest in contributing on this as his first etcd contribution. I will assign @haroldHT for now, @jasper-d and @tskarman any help is welcome!

@wenjiaswe wenjiaswe self-assigned this Mar 4, 2019
@wenjiaswe
Copy link
Contributor

/assign @haroldHT

@wenjiaswe wenjiaswe removed their assignment Mar 4, 2019
@wenjiaswe
Copy link
Contributor

well, it seems like I cannot assign you @haroldHT now, this is a good place to start your contribution. Thanks!

@jasper-d
Copy link
Author

jasper-d commented Mar 5, 2019

@haroldHT I managed to botch up the code base enough to make etcd run as a windows service. It properly interacts with Windows Service Control Manager through x/sys/windows/svc. I haven't fully tested it yet, but as far as I can tell it works for ordinary cluster members, level 4 gateways and gRPC proxies. Log output is redirected to Windows Event Log or a file (logs are confusing indeed, I ended up redirecting every log that didn't hide well enough 🙃).
I need to clean up some stuff before I can push it to a public repo, but I will do so tomorrow so you can take a look.

cc: @hexfusion @tskarman

@hexfusion
Copy link
Contributor

hexfusion commented Mar 6, 2019 via email

@jasper-d
Copy link
Author

jasper-d commented Mar 6, 2019

Changes (with some comments) are here: jasper-d@9d92352

The good:

  • Runs as a Windows service, notifies SCM (Service Control Manager) when started (i.e. isn't timed out by SCM anymore) and shuts down gracefully when receiving a stop signal from SCM
  • Works for cluster members as well as gRPC and layer 4 proxy
  • Logs are (partially) redirected (stdout/stderr doesn't work for Windows services), either to Windows event log or a specified file (I'm using lumberjack here to get easy log rotation because there is no logrotate on Windows).

The bad:

  • 0 coverage 😱
  • There are several logs that aren't redirected yet. The ones I know of are the gRPC log, TCP log and raft log. As @haroldHT mentioned, the number of logs is rather confusing.
  • SCM reports an error when stopping the service (A system error has occurred. System error 1067 has occurred. The process terminated unexpectedly.). I need to investigate what's happening there.
  • Linux/tests are probably broken, haven't checked yet
  • Running a multi-node cluster doesn't work right now. I have yet to verify if that's an issue with my cluster config or a bug introduced by my changes.
  • There is no shutdown logic yet for proxies/gateways. I assume that sockets/grpc client/server must be properly closed when terminating

So, there is still a lot of work to do. Before continuing I would need to add at least some tests and set up a proper testing environment. The main problems remain the wealth of logs (I would certainly need some advice here) and the different ways in which etcd is started. I think that should be unified ideally, but that would probably be quite a refactoring (i.e. should be done by someone with a better understanding of the code base and go).

@haroldHT How is it going for you?
@hexfusion I wouldn't mind investing some more time but you may wanna take a look at it first to determine if it's worth the effort. I also cannot make any definitive commitments to a timeline because it's essentially a pet project for learning some go in my spare time.

@haroldHT
Copy link
Contributor

haroldHT commented Mar 8, 2019

@wenjiaswe sorry, I did not reply in time.
I will continue to follow your suggestion.

@haroldHT
Copy link
Contributor

haroldHT commented Mar 8, 2019

@jasper-d At the beginning I want to use kardianos/service to make etcd become a service.
It seems a good solution if we can manage the number of logs.
your solution also make me benefit a lot, thanks.

@stale
Copy link

stale bot commented Apr 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 7, 2020
@stale stale bot closed this as completed Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants