Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discovery failing on EC2 #576

Closed
drnic opened this issue Feb 16, 2014 · 45 comments
Closed

Discovery failing on EC2 #576

drnic opened this issue Feb 16, 2014 · 45 comments

Comments

@drnic
Copy link
Contributor

drnic commented Feb 16, 2014

Does etcd & its -discovery mode work on EC2; or is it just failing for me?

$ etcd -discovery https://discovery.etcd.io/madeup
[etcd] Feb 16 21:19:57.292 WARNING   | Using the directory 51e7ef11-e39b-41a9-a8eb-f8ed387ac5cc.etcd as the etcd curation directory because a directory was not specified. 
[etcd] Feb 16 21:19:57.292 INFO      | Discovery via https://discovery.etcd.io using prefix /madeup.
[etcd] Feb 16 21:19:57.703 CRITICAL  | Discovery failed and a backup peer list wasn't provided: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

Security group for the EC2 vm its running in:

ec2_management_console

@polvi
Copy link
Contributor

polvi commented Feb 16, 2014

You have to use a valid token by hitting discovery.etcd.io/new
On Feb 16, 2014 1:38 PM, "Dr Nic Williams" [email protected] wrote:

Does etcd & its -discovery mode work on EC2; or is it just failing for me?

$ etcd -discovery https://discovery.etcd.io/madeup
[etcd] Feb 16 21:19:57.292 WARNING | Using the directory 51e7ef11-e39b-41a9-a8eb-f8ed387ac5cc.etcd as the etcd curation directory because a directory was not specified.
[etcd] Feb 16 21:19:57.292 INFO | Discovery via https://discovery.etcd.io using prefix /madeup.
[etcd] Feb 16 21:19:57.703 CRITICAL | Discovery failed and a backup peer list wasn't provided: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

Security group for the EC2 vm its running in:

[image: ec2_management_console]https://f.cloud.github.com/assets/108/2181480/9849a4d2-9752-11e3-9d75-e328359ad5df.png

Reply to this email directly or view it on GitHubhttps://github.com//issues/576
.

@drnic
Copy link
Contributor Author

drnic commented Feb 16, 2014

That also didn't work. Sorry, I can show that output too.

On Sun, Feb 16, 2014 at 1:41 PM, polvi [email protected] wrote:

You have to use a valid token by hitting discovery.etcd.io/new
On Feb 16, 2014 1:38 PM, "Dr Nic Williams" [email protected] wrote:

Does etcd & its -discovery mode work on EC2; or is it just failing for me?

$ etcd -discovery https://discovery.etcd.io/madeup
[etcd] Feb 16 21:19:57.292 WARNING | Using the directory 51e7ef11-e39b-41a9-a8eb-f8ed387ac5cc.etcd as the etcd curation directory because a directory was not specified.
[etcd] Feb 16 21:19:57.292 INFO | Discovery via https://discovery.etcd.io using prefix /madeup.
[etcd] Feb 16 21:19:57.703 CRITICAL | Discovery failed and a backup peer list wasn't provided: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

Security group for the EC2 vm its running in:

[image: ec2_management_console]https://f.cloud.github.com/assets/108/2181480/9849a4d2-9752-11e3-9d75-e328359ad5df.png

Reply to this email directly or view it on GitHubhttps://github.com//issues/576
.


Reply to this email directly or view it on GitHub:
#576 (comment)

@drnic
Copy link
Contributor Author

drnic commented Feb 16, 2014

$ curl https://discovery.etcd.io/new
https://discovery.etcd.io/02c009dbb1888fc8b710c650fbf55642
$ etcd -discovery https://discovery.etcd.io/02c009dbb1888fc8b710c650fbf55642
[etcd] Feb 16 21:50:50.595 WARNING   | Using the directory 51e7ef11-e39b-41a9-a8eb-f8ed387ac5cc.etcd as the etcd curation directory because a directory was not specified. 
[etcd] Feb 16 21:50:50.595 INFO      | Discovery via https://discovery.etcd.io using prefix /02c009dbb1888fc8b710c650fbf55642.
[etcd] Feb 16 21:50:51.006 CRITICAL  | Discovery failed and a backup peer list wasn't provided: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

@drnic
Copy link
Contributor Author

drnic commented Feb 16, 2014

Though I thought I read you can make up your own tokens.

@drnic
Copy link
Contributor Author

drnic commented Feb 16, 2014

etcd must work on AWS, at least VPC, as the Cloud Foundry group are using it in production for http://run.pivotal.io; so I'm not 100% sure what I'm not doing right. Though they aren't using -discovery and I think v0.2.0

@polvi
Copy link
Contributor

polvi commented Feb 16, 2014

Grab a new token, then use that URL for all the -discovery args. Add -vv to
etcd as well and gist the log output. Could you also confirm that you can
manual connect to the remote etcd server port with telnet or something?
On Feb 16, 2014 1:50 PM, "Dr Nic Williams" [email protected] wrote:

That also didn't work. Sorry, I can show that output too.

On Sun, Feb 16, 2014 at 1:41 PM, polvi [email protected] wrote:

You have to use a valid token by hitting discovery.etcd.io/new
On Feb 16, 2014 1:38 PM, "Dr Nic Williams" [email protected]
wrote:

Does etcd & its -discovery mode work on EC2; or is it just failing for
me?

$ etcd -discovery https://discovery.etcd.io/madeup
[etcd] Feb 16 21:19:57.292 WARNING | Using the directory
51e7ef11-e39b-41a9-a8eb-f8ed387ac5cc.etcd as the etcd curation directory
because a directory was not specified.
[etcd] Feb 16 21:19:57.292 INFO | Discovery via
https://discovery.etcd.io using prefix /madeup.
[etcd] Feb 16 21:19:57.703 CRITICAL | Discovery failed and a backup
peer list wasn't provided: 501: All the given peers are not reachable
(Tried to connect to each peer twice and failed) [0]

Security group for the EC2 vm its running in:

[image: ec2_management_console]<
https://f.cloud.github.com/assets/108/2181480/9849a4d2-9752-11e3-9d75-e328359ad5df.png>

Reply to this email directly or view it on GitHub<
https://github.com/coreos/etcd/issues/576>
.


Reply to this email directly or view it on GitHub:
#576 (comment)

Reply to this email directly or view it on GitHubhttps://github.com//issues/576#issuecomment-35215196
.

@drnic
Copy link
Contributor Author

drnic commented Feb 16, 2014

-vv isn't generating any additional output; odd.

$ etcd -discovery https://discovery.etcd.io/896ac583ef83bebad23a4ba51277de11 -vv
[etcd] Feb 16 21:55:09.813 WARNING   | Using the directory 51e7ef11-e39b-41a9-a8eb-f8ed387ac5cc.etcd as the etcd curation directory because a directory was not specified. 
[etcd] Feb 16 21:55:09.814 INFO      | Discovery via https://discovery.etcd.io using prefix /896ac583ef83bebad23a4ba51277de11.
[etcd] Feb 16 21:55:10.225 CRITICAL  | Discovery failed and a backup peer list wasn't provided: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

@drnic
Copy link
Contributor Author

drnic commented Feb 16, 2014

@polvi thx for trying to help, btw

@polvi
Copy link
Contributor

polvi commented Feb 16, 2014

Could you show the output from all the machines? Please include full cmd
line args from each host.

On Sun, Feb 16, 2014 at 1:57 PM, Dr Nic Williams
[email protected]:

@polvi https://github.com/polvi thx for trying to help, btw

Reply to this email directly or view it on GitHubhttps://github.com//issues/576#issuecomment-35216957
.

@drnic
Copy link
Contributor Author

drnic commented Feb 16, 2014

This was a standalone example; no other nodes. It doesn't pause and wait; it just fails.

When I ran three nodes etcd startup failed so quickly that there was no concept that they were really trying to communicate with each other. So I thought to reproduce the error with a single node.

On Sun, Feb 16, 2014 at 2:38 PM, polvi [email protected] wrote:

Could you show the output from all the machines? Please include full cmd
line args from each host.
On Sun, Feb 16, 2014 at 1:57 PM, Dr Nic Williams
[email protected]:

@polvi https://github.com/polvi thx for trying to help, btw

Reply to this email directly or view it on GitHubhttps://github.com//issues/576#issuecomment-35216957
.


Reply to this email directly or view it on GitHub:
#576 (comment)

@drnic
Copy link
Contributor Author

drnic commented Feb 16, 2014

I can't get etcd on my ubuntu 10.04 EV2 VMs to do anything other than fail immediately when I use -discovery option.

On Sun, Feb 16, 2014 at 2:38 PM, polvi [email protected] wrote:

Could you show the output from all the machines? Please include full cmd
line args from each host.
On Sun, Feb 16, 2014 at 1:57 PM, Dr Nic Williams
[email protected]:

@polvi https://github.com/polvi thx for trying to help, btw

Reply to this email directly or view it on GitHubhttps://github.com//issues/576#issuecomment-35216957
.


Reply to this email directly or view it on GitHub:
#576 (comment)

@philips
Copy link
Contributor

philips commented Feb 17, 2014

@drnic Even with discovery you need to provide the -addr and -peer-addr arguments so that discovery uploads the right ip addresses.

@drnic
Copy link
Contributor Author

drnic commented Feb 17, 2014

$ etcd -peer-addr 127.0.0.1:7001 -addr 127.0.0.1:4001 -discovery https://discovery.etcd.io/2d0cf5e9cf33d601caf861bb304117a1
[etcd] Feb 17 05:47:23.743 WARNING   | Using the directory 51e7ef11-e39b-41a9-a8eb-f8ed387ac5cc.etcd as the etcd curation directory because a directory was not specified. 
[etcd] Feb 17 05:47:23.744 INFO      | Discovery via https://discovery.etcd.io using prefix /2d0cf5e9cf33d601caf861bb304117a1.
[etcd] Feb 17 05:47:24.156 CRITICAL  | Discovery failed and a backup peer list wasn't provided: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

@drnic
Copy link
Contributor Author

drnic commented Feb 17, 2014

@philips sorry, I thought these had default values if not specified

@drnic
Copy link
Contributor Author

drnic commented Feb 17, 2014

Still get same issue on AWS EC2.

@drnic
Copy link
Contributor Author

drnic commented Feb 17, 2014

Happy to try other debugging ideas. Sorry its not working for me :/

@drnic
Copy link
Contributor Author

drnic commented Feb 17, 2014

Unlikely to be relevant but...

$ uname -a
Linux 51e7ef11-e39b-41a9-a8eb-f8ed387ac5cc 3.0.0-32-virtual #51~lucid1-Ubuntu SMP Fri Mar 22 18:13:07 UTC 2013 x86_64 GNU/Linux

@polvi
Copy link
Contributor

polvi commented Feb 17, 2014

Try with -addr, -peer-addr, and a fresh -discovery url

On Sun, Feb 16, 2014 at 10:08 PM, Dr Nic Williams
[email protected]:

Unlikely to be relevant but...

$ uname -a
Linux 51e7ef11-e39b-41a9-a8eb-f8ed387ac5cc 3.0.0-32-virtual #51~lucid1-Ubuntu SMP Fri Mar 22 18:13:07 UTC 2013 x86_64 GNU/Linux

Reply to this email directly or view it on GitHubhttps://github.com//issues/576#issuecomment-35231037
.

@drnic
Copy link
Contributor Author

drnic commented Feb 17, 2014

With fresh token & explicit name

$ etcd -peer-addr 127.0.0.1:7001 -addr 127.0.0.1:4001 -discovery https://discovery.etcd.io/2d3468dabe9d36a54b50c5e05ceec623 -name aaa
[etcd] Feb 17 06:40:45.138 WARNING   | Using the directory aaa.etcd as the etcd curation directory because a directory was not specified. 
[etcd] Feb 17 06:40:45.138 INFO      | Discovery via https://discovery.etcd.io using prefix /2d3468dabe9d36a54b50c5e05ceec623.
[etcd] Feb 17 06:40:45.550 CRITICAL  | Discovery failed and a backup peer list wasn't provided: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

@yichengq
Copy link
Contributor

I have tried to build it exactly as you describe and it works well.

I guess it may be the problem of version.
Which version of etcd did you use?
I checked it from https://github.com/coreos/etcd.git , and built it manually.

@drnic
Copy link
Contributor Author

drnic commented Feb 17, 2014

I was using v0.3.0 release. I can try to build from HEAD if you think there is a fix since 0.3.0?

For my curiosity - could you download v0.3.0 and see if it works?

On Mon, Feb 17, 2014 at 10:39 AM, Yicheng Qin [email protected]
wrote:

I have tried to build it exactly as you describe and it works well.
I guess it may be the problem of version.
Which version of etcd did you use?

I checked it from https://github.com/coreos/etcd.git , and built it manually.

Reply to this email directly or view it on GitHub:
#576 (comment)

@philips
Copy link
Contributor

philips commented Feb 17, 2014

@drnic I tried a two node cluster on an ec2 machine using the release tarball just now and it works fine:

$ cd etcd-v0.3.0-linux-amd64
$ ./etcd -discovery https://discovery.etcd.io/07a71c7632415ffde6a3cf14533e88f3
[etcd] Feb 17 18:43:40.755 WARNING   | Using the directory ip-10-244-134-157.etcd as the etcd curation directory because a directory was not specified.
[etcd] Feb 17 18:43:40.756 INFO      | Discovery via https://discovery.etcd.io using prefix /07a71c7632415ffde6a3cf14533e88f3.
[etcd] Feb 17 18:43:41.212 INFO      | Discovery _state was empty, so this machine is the initial leader.
[etcd] Feb 17 18:43:41.213 INFO      | ip-10-244-134-157: state changed from 'stopped' to 'follower'.
[etcd] Feb 17 18:43:41.213 INFO      | ip-10-244-134-157: state changed from 'follower' to 'leader'.
[etcd] Feb 17 18:43:41.213 INFO      | ip-10-244-134-157: leader changed from '' to 'ip-10-244-134-157'.
[etcd] Feb 17 18:43:41.260 INFO      | etcd server [name ip-10-244-134-157, listen on [::]:4001, advertised url http://127.0.0.1:4001]
[etcd] Feb 17 18:43:41.261 INFO      | peer server [name ip-10-244-134-157, listen on [::]:7001, advertised url http://127.0.0.1:7001]
$ ./etcd -discovery 'https://discovery.etcd.io/07a71c7632415ffde6a3cf14533e88f3' -peer-addr 127.0.0.1:7002 -addr 127.0.0.1:4002 -name 2 -data-dir 2.etcd
[etcd] Feb 17 18:46:23.235 INFO      | Discovery via https://discovery.etcd.io using prefix /07a71c7632415ffde6a3cf14533e88f3.
[etcd] Feb 17 18:46:23.782 INFO      | Discovery found peers [http://127.0.0.1:7001]
[etcd] Feb 17 18:46:23.783 INFO      | 2: state changed from 'stopped' to 'follower'.
[etcd] Feb 17 18:46:23.824 INFO      | etcd server [name 2, listen on [::]:4002, advertised url http://127.0.0.1:4002]
[etcd] Feb 17 18:46:23.825 INFO      | peer server [name 2, listen on [::]:7002, advertised url http://127.0.0.1:7002]
[etcd] Feb 17 18:46:23.836 INFO      | 2: leader changed from '' to 'ip-10-244-134-157'.
[etcd] Feb 17 18:46:23.888 INFO      | 2: peer added: 'ip-10-244-134-157'

@yichengq
Copy link
Contributor

@drnic May you build the HEAD and try it again?
I have let etcd print out more debug info for discovery process.

@lnguyen
Copy link
Contributor

lnguyen commented Feb 18, 2014

Ok this is very odd... we have monit monitor etcd and keep trying to bring it up. It seem that eventually... no idea why but discovery does work. Refer to gist for logs https://gist.github.com/longnguyen11288/077267961ac2287cd210

@yichengq
Copy link
Contributor

@drnic @longnguyen11288
https://github.com/unihorn/etcd/tree/8
May you build this one and try it again?
It will print out more debug info for discovery process in the file testlogfile

@zeisss
Copy link

zeisss commented Feb 20, 2014

I have the same problem with my coreos VM on VirtualBox with etcd v0.3.0.

$ sudo /usr/bin/etcd -vv -bind-addr 192.168.53.4:4001 -peer-addr 192.168.53.4:7001 -discovery=https://discovery.etcd.io/7cdbad7b454f233575c4315788490f06 -data-dir /var/lib/etcd -name dockzero-03 -f
[etcd] Feb 20 09:41:46.218 INFO      | Discovery via https://discovery.etcd.io using prefix /7cdbad7b454f233575c4315788490f06.
[etcd] Feb 20 09:41:48.625 CRITICAL  | Discovery failed and a backup peer list wasn't provided: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

Removing the -discovery argument brings up etcd correctly. I also removed the data-dir, in case their was some invalid state left over, but that didn't help either.

@unihorn I also tried your branch - no testlogfile was written.

EDIT: Using coreos/master (46d817f), I got a bit more output:

$ sudo ./bin/etcd -vv -bind-addr=192.168.53.4:4001 -peer-addr=192.168.53.4:7001 -discovery=https://discovery.etcd.io/7cdbad7b454f233575c4315788490f06
[etcd] Feb 20 10:16:55.192 WARNING   | Using the directory dockzero-03.etcd as the etcd curation directory because a directory was not specified. 
[etcd] Feb 20 10:16:55.192 DEBUG     | open dockzero-03.etcd/snapshot: no such file or directory
[raft]10:16:55.193038 log.open.open  dockzero-03.etcd/log
[raft]10:16:55.193659 log.open.create  dockzero-03.etcd/log
[etcd] Feb 20 10:16:55.194 INFO      | dockzero-03: state changed from 'stopped' to 'follower'.
[raft]10:16:55.195338 Name: dockzero-03, State: follower, Term: 0, CommitedIndex: 0 
[etcd] Feb 20 10:16:55.195 INFO      | Discovery via https://discovery.etcd.io using prefix /7cdbad7b454f233575c4315788490f06.
[etcd] Feb 20 10:16:57.603 WARNING   | Discovery encountered an error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
[etcd] Feb 20 10:16:57.603 INFO      | URLs:  / dockzero-03 ()
[etcd] Feb 20 10:16:57.603 CRITICAL  | Discovery failed, no available peers in backup list, and no log data

@bfosberry
Copy link

I have the same issue, same version, inside coreos. Using a discovery url or token fails for any node being spun up. Having a standalone node, and using an explicit peer nodes list for other nodes works, however that kind of defeats the purpose. :P

core@core-02 ~ $ etcd --version
v0.3.0
core@core-02 ~ $ uname -a
Linux core-02 3.13.2+ #2 SMP Mon Feb 17 22:49:34 UTC 2014 x86_64 Intel(R) Core(TM) i7-3667U CPU @ 2.00GHz GenuineIntel GNU/Linux

@yichengq
Copy link
Contributor

yichengq commented Mar 6, 2014

@zeisss @bfosberry
May you try this branch again?
https://github.com/unihorn/etcd/tree/28

It prints out the debug information from go-etcd, and it will help me to find out the reasons for the error a lot.
Sorry for the late response.

@newhoggy
Copy link

I also have the same problem inside coreos:

core@ip-10-0-0-184 ~ $ etcd --version
v0.3.0
core@ip-10-0-0-184 ~ $ uname -a
Linux ip-10-0-0-184 3.13.5+ #2 SMP Wed Mar 5 08:34:30 UTC 2014 x86_64 Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz GenuineIntel GNU/Linux

Logs:

Mar 10 10:59:13 ip-10-0-0-183 systemd[1]: Stopping etcd...
Mar 10 10:59:14 ip-10-0-0-183 systemd[1]: Starting etcd...
Mar 10 10:59:14 ip-10-0-0-183 systemd[1]: Started etcd.
Mar 10 10:59:14 ip-10-0-0-183 etcd-bootstrap[20016]: [etcd] Mar 10 10:59:14.537 INFO      | Discovery via https://discovery.etcd.io using prefix /e99066c9e19b4472b002826217bd3f28.
Mar 10 10:59:15 ip-10-0-0-183 etcd-bootstrap[20016]: [etcd] Mar 10 10:59:15.748 INFO      | Discovery found peers [http://10.0.0.184:7001 http://10.0.0.182:7001]
Mar 10 10:59:15 ip-10-0-0-183 etcd-bootstrap[20016]: [etcd] Mar 10 10:59:15.749 INFO      | 10.0.0.183: state changed from 'stopped' to 'follower'.
Mar 10 10:59:17 ip-10-0-0-183 etcd-bootstrap[20016]: [etcd] Mar 10 10:59:17.101 WARNING   | Attempt to join via 10.0.0.184:7001 failed: Error during join version check: Get http://10.0.0.184:7001/version: net/http: timeout awaiting response heade
Mar 10 10:59:18 ip-10-0-0-183 etcd-bootstrap[20016]: [etcd] Mar 10 10:59:18.452 WARNING   | Attempt to join via 10.0.0.182:7001 failed: Error during join version check: Get http://10.0.0.182:7001/version: net/http: timeout awaiting response heade
Mar 10 10:59:18 ip-10-0-0-183 etcd-bootstrap[20016]: [etcd] Mar 10 10:59:18.453 WARNING   | Unable to join the cluster using any of the peers [10.0.0.184:7001 10.0.0.182:7001]. Retrying in 10.0 seconds

@zeisss
Copy link

zeisss commented Mar 10, 2014

@unihorn: Strange things happened. My first try failed again, but later in the progress (see https://gist.github.com/ZeissS/9462727).
Afterwards my colleague told me he upgraded vagrant and his problem disappeared, I did the same. (Vagrant 1.3.5 to 1.4.3).
CoreOS itself (CoreOS 247 w/ Etcd v0.3.0) does not seem to work, but using your version (unihorn/28) worked just fine. So I guess a vagrant update and using the master fixes this for me.

Is there a schedule for the next etcd release, which goes into coreos?

@philips
Copy link
Contributor

philips commented Mar 10, 2014

@zeisss There isn't a schedule for the next release. I would like to make a release after this bug is closed. But, we need to track it down first.

@newhoggy
Copy link

Have you managed to reproduced in on EC2?

I'm currently using a single host cluster to work around the problem.

@yichen
Copy link

yichen commented Mar 12, 2014

Hey, looks like I have the exact same problem. I have three EC2 instances. Two of them can discover each other, the third one will fail with the "Unable to join the cluster using any of the peers" error shown above.

After the third instance failed to join, the previous two instances will show etcd warning: heartbeat time out: 'instance3'.

@philips
Copy link
Contributor

philips commented Mar 12, 2014

@yichen That sounds like discovery is actually working but there is a network partition for some reason or the etcd service isn't running on instance3. Can you try restarting instance3 and see if it comes back?

@yichen
Copy link

yichen commented Mar 13, 2014

Hey guys, thanks for the quick response. This problem was resolved. Sorry I am not entirely sure what was the root reason, here is the list of things I did:

  • Noticed that on one of the instance I am using the local address 127.0.0.1, which result in this address showing up in the discovery URL. It might eventually expire but I created a new discovery "prefix" to start with a clean slate.
  • deleted all working directory and restarted all the etcd instances, made sure the first one started without the --discovery parameter, and the other two with the --discovery parameter.

@philips
Copy link
Contributor

philips commented Mar 13, 2014

Last night I had a strange network partition on ec2 which reproduced in this manner. I had three machines L was leader, and there were two follower A and B.

A was unable to join the cluster nor curl any of L's endpoints. B was participating just fine.

Then I setup ncat on L listening on another port and started echo foobar | ncat L 7002. It hung for around 3 seconds and then foobar showed up on L.

@yichengq
Copy link
Contributor

@philips It could be the problem about connection timeout settings, and boot order of raft and peer server. I think @xiangli-cmu is fixing it now: #626

@drnic
Copy link
Contributor Author

drnic commented Mar 14, 2014

I wasn't seeing partial failure as you've been seeing above. Sorry I haven't made time to try out the debug version. :(

On Thu, Mar 13, 2014 at 6:28 PM, Yicheng Qin [email protected]
wrote:

@philips It could be the problem about connection timeout settings, and boot order of raft and peer server. I think @xiangli-cmu is fixing it now: #626

Reply to this email directly or view it on GitHub:
#576 (comment)

@bfosberry
Copy link

@unihorn that branch (28) worked for me. Based on what @viliamjr said I tried adding in

 config.vm.provider :virtualbox do |vb, override|
    vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
    vb.customize ["modifyvm", :id, "--natdnsproxy1", "on"]
  end

to my vagrant config for vbox in case it was a routing error, but no good. Also I can manually hit the url outside of etcd.

Your branch worked great though, any ideas a) when that will get released and b) when it will get incorporated into coreos?

@yichengq
Copy link
Contributor

@bfosberry It is really weird that it could have problem sometimes.
We have refactored listen code in #626, and hope that could help.
And there are other issues opened for it.
I would expect it could be eliminated, or at least reasoned after these PRs.
Please use -vvv flag for report after #653 is merged.

@bfosberry
Copy link

Latest vagrant coreos image works!

from https://github.com/coreos/coreos-vagrant/blob/master/Vagrantfile

@yichengq
Copy link
Contributor

@bfosberry Great! :) :)

@xiang90 xiang90 closed this as completed Aug 23, 2014
@bataras
Copy link

bataras commented Sep 5, 2014

I'm getting discovery failure and also SLOW discovery. cluster w/5 machines. 2 in us-west and 3 in us-east. 2 VPCs with VPN between them. The networking is -solid-. I can ssh between all nodes and ping consistently at 1ms within a region and 60ms between regions. 4 of the 5 nodes have clustered. But the 5th node in us-east keeps giving this error...

$etcdctl ls
Error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

Of note.. I created 2 nodes in us-west first. They clustered immediately. Then added a node in us-east. It took a long time to join the cluster. like 5-10 minutes, but it finally did. And again, the networking between all nodes and to the internet is fine.

And I started with a new discovery token

@yichengq
Copy link
Contributor

yichengq commented Sep 8, 2014

@bataras We don't support multi datacenter etcd perfectly now. refer #964
The workaround now is to set a higher heartbeat interval and election timeout now.
You could try to set heartbeat interval to be 150ms, and election timeout to be 1s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests