-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discovery failing on EC2 #576
Comments
You have to use a valid token by hitting discovery.etcd.io/new
|
That also didn't work. Sorry, I can show that output too. On Sun, Feb 16, 2014 at 1:41 PM, polvi [email protected] wrote:
|
|
Though I thought I read you can make up your own tokens. |
etcd must work on AWS, at least VPC, as the Cloud Foundry group are using it in production for http://run.pivotal.io; so I'm not 100% sure what I'm not doing right. Though they aren't using -discovery and I think v0.2.0 |
Grab a new token, then use that URL for all the -discovery args. Add -vv to
|
-vv isn't generating any additional output; odd.
|
@polvi thx for trying to help, btw |
Could you show the output from all the machines? Please include full cmd On Sun, Feb 16, 2014 at 1:57 PM, Dr Nic Williams
|
This was a standalone example; no other nodes. It doesn't pause and wait; it just fails. When I ran three nodes etcd startup failed so quickly that there was no concept that they were really trying to communicate with each other. So I thought to reproduce the error with a single node. On Sun, Feb 16, 2014 at 2:38 PM, polvi [email protected] wrote:
|
I can't get etcd on my ubuntu 10.04 EV2 VMs to do anything other than fail immediately when I use -discovery option. On Sun, Feb 16, 2014 at 2:38 PM, polvi [email protected] wrote:
|
@drnic Even with discovery you need to provide the -addr and -peer-addr arguments so that discovery uploads the right ip addresses. |
|
@philips sorry, I thought these had default values if not specified |
Still get same issue on AWS EC2. |
Happy to try other debugging ideas. Sorry its not working for me :/ |
Unlikely to be relevant but...
|
Try with -addr, -peer-addr, and a fresh -discovery url On Sun, Feb 16, 2014 at 10:08 PM, Dr Nic Williams
|
With fresh token & explicit name
|
I have tried to build it exactly as you describe and it works well. I guess it may be the problem of version. |
I was using v0.3.0 release. I can try to build from HEAD if you think there is a fix since 0.3.0? For my curiosity - could you download v0.3.0 and see if it works? On Mon, Feb 17, 2014 at 10:39 AM, Yicheng Qin [email protected]
|
@drnic I tried a two node cluster on an ec2 machine using the release tarball just now and it works fine:
|
@drnic May you build the HEAD and try it again? |
Ok this is very odd... we have monit monitor etcd and keep trying to bring it up. It seem that eventually... no idea why but discovery does work. Refer to gist for logs https://gist.github.com/longnguyen11288/077267961ac2287cd210 |
@drnic @longnguyen11288 |
I have the same problem with my coreos VM on VirtualBox with etcd
Removing the @unihorn I also tried your branch - no EDIT: Using coreos/master (46d817f), I got a bit more output:
|
I have the same issue, same version, inside coreos. Using a discovery url or token fails for any node being spun up. Having a standalone node, and using an explicit peer nodes list for other nodes works, however that kind of defeats the purpose. :P
|
@zeisss @bfosberry It prints out the debug information from go-etcd, and it will help me to find out the reasons for the error a lot. |
I also have the same problem inside coreos:
Logs:
|
@unihorn: Strange things happened. My first try failed again, but later in the progress (see https://gist.github.com/ZeissS/9462727). Is there a schedule for the next etcd release, which goes into coreos? |
@zeisss There isn't a schedule for the next release. I would like to make a release after this bug is closed. But, we need to track it down first. |
Have you managed to reproduced in on EC2? I'm currently using a single host cluster to work around the problem. |
Hey, looks like I have the exact same problem. I have three EC2 instances. Two of them can discover each other, the third one will fail with the "Unable to join the cluster using any of the peers" error shown above. After the third instance failed to join, the previous two instances will show etcd warning: heartbeat time out: 'instance3'. |
@yichen That sounds like discovery is actually working but there is a network partition for some reason or the etcd service isn't running on instance3. Can you try restarting instance3 and see if it comes back? |
Additional info: |
Hey guys, thanks for the quick response. This problem was resolved. Sorry I am not entirely sure what was the root reason, here is the list of things I did:
|
Last night I had a strange network partition on ec2 which reproduced in this manner. I had three machines L was leader, and there were two follower A and B. A was unable to join the cluster nor curl any of L's endpoints. B was participating just fine. Then I setup ncat on L listening on another port and started |
I wasn't seeing partial failure as you've been seeing above. Sorry I haven't made time to try out the debug version. :( On Thu, Mar 13, 2014 at 6:28 PM, Yicheng Qin [email protected]
|
@unihorn that branch (28) worked for me. Based on what @viliamjr said I tried adding in
to my vagrant config for vbox in case it was a routing error, but no good. Also I can manually hit the url outside of etcd. Your branch worked great though, any ideas a) when that will get released and b) when it will get incorporated into coreos? |
@bfosberry It is really weird that it could have problem sometimes. |
Latest vagrant coreos image works! from https://github.com/coreos/coreos-vagrant/blob/master/Vagrantfile |
@bfosberry Great! :) :) |
I'm getting discovery failure and also SLOW discovery. cluster w/5 machines. 2 in us-west and 3 in us-east. 2 VPCs with VPN between them. The networking is -solid-. I can ssh between all nodes and ping consistently at 1ms within a region and 60ms between regions. 4 of the 5 nodes have clustered. But the 5th node in us-east keeps giving this error... $etcdctl ls Of note.. I created 2 nodes in us-west first. They clustered immediately. Then added a node in us-east. It took a long time to join the cluster. like 5-10 minutes, but it finally did. And again, the networking between all nodes and to the internet is fine. And I started with a new discovery token |
Does etcd & its -discovery mode work on EC2; or is it just failing for me?
Security group for the EC2 vm its running in:
The text was updated successfully, but these errors were encountered: