Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime error: integer divide by zero #186

Closed
bodiug opened this issue Nov 15, 2016 · 15 comments
Closed

runtime error: integer divide by zero #186

bodiug opened this issue Nov 15, 2016 · 15 comments
Labels
Milestone

Comments

@bodiug
Copy link

bodiug commented Nov 15, 2016

Hi,

I've run into a bug. When running a service consisting of around 100 containers this error does not occur, but scaling up to several 100's it does. Other services with different urlprefix still continue to work.

Nov 15 09:47:58 random-host fabio[471]: route add local-bla-app hello-world.domain.com/ http://192.168.10.55:8080/ tags "random-host-urlprefix-hello-world.domain.com/"
Nov 15 09:48:37 random-host fabio[471]: 2016/11/15 09:48:37 http: panic serving [::1]:46124: runtime error: integer divide by zero
Nov 15 09:48:37 random-host fabio[471]: goroutine 28 [running]:
Nov 15 09:48:37 random-host fabio[471]: net/http.(*conn).serve.func1(0xc421e0c400)
Nov 15 09:48:37 random-host fabio[471]: /Users/frschroeder/go1.7.3/src/net/http/server.go:1491 +0x12a
Nov 15 09:48:37 random-host fabio[471]: panic(0x7d13c0, 0xc42000e070)
Nov 15 09:48:37 random-host fabio[471]: /Users/frschroeder/go1.7.3/src/runtime/panic.go:458 +0x243
Nov 15 09:48:37 random-host fabio[471]: github.com/eBay/fabio/route.glob..func1(0x0, 0xc421d44101)
Nov 15 09:48:37 random-host fabio[471]: /Users/frschroeder/gopath/src/github.com/eBay/fabio/route/picker.go:48 +0x92
Nov 15 09:48:37 random-host fabio[471]: github.com/eBay/fabio/route.rndPicker(0xc42004f680, 0x1)
Nov 15 09:48:37 random-host fabio[471]: /Users/frschroeder/gopath/src/github.com/eBay/fabio/route/picker.go:30 +0x36
Nov 15 09:48:37 random-host fabio[471]: github.com/eBay/fabio/route.Table.lookup(0xc421d420c0, 0xc421e102b0, 0xf, 0xc421e10294, 0x1, 0x0, 0x0, 0x6563617254)
Nov 15 09:48:37 random-host fabio[471]: /Users/frschroeder/gopath/src/github.com/eBay/fabio/route/table.go:281 +0x3ea
Nov 15 09:48:37 random-host fabio[471]: github.com/eBay/fabio/route.Table.Lookup(0xc421d420c0, 0xc4200b33b0, 0x0, 0x0, 0x0)
Nov 15 09:48:37 random-host fabio[471]: /Users/frschroeder/gopath/src/github.com/eBay/fabio/route/table.go:253 +0x226
Nov 15 09:48:37 random-host fabio[471]: github.com/eBay/fabio/proxy.target(0xc4200b33b0, 0x7f5f2453b401)
Nov 15 09:48:37 random-host fabio[471]: /Users/frschroeder/gopath/src/github.com/eBay/fabio/proxy/util.go:94 +0x8b
Nov 15 09:48:37 random-host fabio[471]: github.com/eBay/fabio/proxy.(*httpProxy).ServeHTTP(0xc42019e0f0, 0xa1d900, 0xc420165040, 0xc4200b33b0)
Nov 15 09:48:37 random-host fabio[471]: /Users/frschroeder/gopath/src/github.com/eBay/fabio/proxy/proxy.go:35 +0x9c
Nov 15 09:48:37 random-host fabio[471]: net/http.serverHandler.ServeHTTP(0xc420070100, 0xa1d900, 0xc420165040, 0xc4200b33b0)
Nov 15 09:48:37 random-host fabio[471]: /Users/frschroeder/go1.7.3/src/net/http/server.go:2202 +0x7d
Nov 15 09:48:37 random-host fabio[471]: net/http.(*conn).serve(0xc421e0c400, 0xa1e100, 0xc420c847c0)
Nov 15 09:48:37 random-host fabio[471]: /Users/frschroeder/go1.7.3/src/net/http/server.go:1579 +0x4b7
Nov 15 09:48:37 random-host fabio[471]: created by net/http.(*Server).Serve
Nov 15 09:48:37 random-host fabio[471]: /Users/frschroeder/go1.7.3/src/net/http/server.go:2293 +0x44d
Nov 15 09:48:57 random-host fabio[471]: 2016/11/15 09:48:57 [INFO] consul: Health changed to #107221
Nov 15 09:49:35 random-host fabio[471]: 2016/11/15 09:49:35 [INFO] consul: Health changed to #107231
Nov 15 09:49:42 random-host fabio[471]: 2016/11/15 09:49:42 [INFO] consul: Health changed to #107233
Nov 15 09:49:44 random-host fabio[471]: 2016/11/15 09:49:44 [INFO] consul: Health changed to #107235
Nov 15 09:49:44 random-host fabio[471]: 2016/11/15 09:49:44 [INFO] consul: Health changed to #107396
Nov 15 09:49:44 random-host fabio[471]: 2016/11/15 09:49:44 [INFO] Updated config to
Nov 15 09:49:44 random-host fabio[471]: route add local-bla-app hello-world.domain.com/ http://192.168.8.8:8080/ tags "random-host-urlprefix-hello-world.domain.com/"
@magiconair
Copy link
Contributor

On vacation right now but I'll have a look later. That looks like an edge case I've missed.

@bodiug
Copy link
Author

bodiug commented Nov 16, 2016

No worries, thanks for the heads-up. Enjoy the remaining of your vacation!

@magiconair
Copy link
Contributor

You've indeed found a bug and it is in code for which I don't even have a test... How embarrassing.

What happens is that the weighing algorithm which pre-computes the distribution of the target URLs based on weight allocates only 100 slots (= 100%) and when you create more than a couple hundred instances the rounding errors ensure that no instance is used. This then trips the random number function which never thought it would be called with zero. Why would anybody create more than 100 instances anyway? ;)

Time to refactor. Not a biggie but I need to add a test first. For now stay <= 100 instances per prefix.

I've got an ugly patch which you can apply if this is an urgent production issue. Otherwise, a fix is coming soon.

Thanks for finding this.

@bodiug
Copy link
Author

bodiug commented Nov 17, 2016

Thanks for diving into this, no ugly patching needed. :) This was found during testing 100+ nodes. But don't hesitate to ping me if I can help testing.

@killcity
Copy link

Thanks for posting this bug :) I'm going to be rolling out a single urlprefix to more than 100 nodes, relatively soon (production).

@killcity
Copy link

@magiconair Any idea when this might be resolved? I know you've been all over the place :) We will be rolling out to a couple hundred nodes sometime in Jan/Feb timeframe.

Cheers

@magiconair
Copy link
Contributor

magiconair commented Nov 30, 2016 via email

@magiconair magiconair added the bug label Nov 30, 2016
@magiconair
Copy link
Contributor

@killcity I'll have a look today whether there is a quick workaround. My guess is that changing 100 to 1000 or 10000 in https://github.com/eBay/fabio/blob/master/route/route.go#L222 would do the trick but I need to write a small test to verify. If that works then I could push this first and then deliver a real fix later.

@magiconair
Copy link
Contributor

@killcity Unfortunately, the really quick fix doesn't work. On the upside, I've found that I actually do have a test for the code. Just in a place where I didn't expect it :) I'll keep looking.

@killcity
Copy link

killcity commented Dec 1, 2016

@magiconair Thanks!

magiconair added a commit that referenced this issue Dec 5, 2016
When fabio is presented with more than 200 targets for a single
route then a rounding error had the effect that no target would
get traffic which subsequently leads to a division by zero error
in a code path which does not expect an empty list of targets.

This change modifies the distribution algorithm as follows:

* The case where all targets receive an equal amount of traffic
has now a fast path which just returns the list of targets.

* For the other case a larger ring of 10.000 slots is always
used to achieve a certain amount of accuracy. In addition, all
targets that have a non-zero weight will use at least one slot
to work around rounding issues and the fact that some servers
would not receive traffic although they are alive and healthy.

This should guarantee an unlimited number of instances for a
single route.
magiconair added a commit that referenced this issue Dec 5, 2016
When fabio is presented with more than 200 targets for a single
route then a rounding error had the effect that no target would
get traffic which subsequently lead to a division by zero error
in a code path which did not expect an empty list of targets.

This change modifies the distribution algorithm as follows:

* The case where all targets receive an equal amount of traffic
has now a fast path which just returns the list of targets.

* For the other case a larger ring of 10.000 slots is always
used to achieve a certain amount of accuracy. In addition, all
targets that have a non-zero weight will use at least one slot
to work around rounding issues and the fact that some servers
would not receive traffic although they are alive and healthy.

This should guarantee an unlimited number of instances for a
single route.
magiconair added a commit that referenced this issue Dec 5, 2016
When fabio is presented with more than 200 targets for a single
route then a rounding error had the effect that no target would
get traffic which subsequently lead to a division by zero error
in a code path which did not expect an empty list of targets.

This change modifies the distribution algorithm as follows:

* The case where all targets receive an equal amount of traffic
has now a fast path which just returns the list of targets.

* For the other case a larger ring of 10.000 slots is always
used to achieve a certain amount of accuracy. In addition, all
targets that have a non-zero weight will use at least one slot
to work around rounding issues and the fact that some servers
would not receive traffic although they are alive and healthy.

This should guarantee an unlimited number of instances for a
single route.
@magiconair
Copy link
Contributor

@killcity I've pushed a change which should fix the issue and which also adds a fast-path for the default case of all targets receiving equal amounts of traffic. The new code guarantees that a target with a computed weight > 0 will receive traffic eventually if you use the rr scheduler. For the default rnd scheduler this guarantee cannot be made. This should allow an unlimited number of targets per route. Would be great if you could test this.

@magiconair
Copy link
Contributor

tests fail on travis but work locally. I'll have a look tomorrow.

magiconair added a commit that referenced this issue Dec 5, 2016
When fabio is presented with more than 200 targets for a single
route then a rounding error had the effect that no target would
get traffic which subsequently lead to a division by zero error
in a code path which did not expect an empty list of targets.

This change modifies the distribution algorithm as follows:

* The case where all targets receive an equal amount of traffic
has now a fast path which just returns the list of targets.

* For the other case a larger ring of 10.000 slots is always
used to achieve a certain amount of accuracy. In addition, all
targets that have a non-zero weight will use at least one slot
to work around rounding issues and the fact that some servers
would not receive traffic although they are alive and healthy.

This should guarantee an unlimited number of instances for a
single route.
@magiconair
Copy link
Contributor

Tests are timing out. Parsing 2500 routes is somewhat slow. Need to check whether the new parser will be faster.

magiconair added a commit that referenced this issue Dec 5, 2016
When fabio is presented with more than 200 targets for a single
route then a rounding error had the effect that no target would
get traffic which subsequently lead to a division by zero error
in a code path which did not expect an empty list of targets.

This change modifies the distribution algorithm as follows:

* The case where all targets receive an equal amount of traffic
has now a fast path which just returns the list of targets.

* For the other case a larger ring of 10.000 slots is always
used to achieve a certain amount of accuracy. In addition, all
targets that have a non-zero weight will use at least one slot
to work around rounding issues and the fact that some servers
would not receive traffic although they are alive and healthy.

This should guarantee an unlimited number of instances for a
single route.
magiconair added a commit that referenced this issue Dec 5, 2016
When fabio is presented with more than 200 targets for a single
route then a rounding error had the effect that no target would
get traffic which subsequently lead to a division by zero error
in a code path which did not expect an empty list of targets.

This change modifies the distribution algorithm as follows:

* The case where all targets receive an equal amount of traffic
has now a fast path which just returns the list of targets.

* For the other case a larger ring of 10.000 slots is always
used to achieve a certain amount of accuracy. In addition, all
targets that have a non-zero weight will use at least one slot
to work around rounding issues and the fact that some servers
would not receive traffic although they are alive and healthy.

This should guarantee an unlimited number of instances for a
single route.
@magiconair
Copy link
Contributor

Test added a nice stress test for url.String() :) Fixed now (hopefully).

magiconair added a commit that referenced this issue Dec 6, 2016
When fabio is presented with more than 200 targets for a single
route then a rounding error had the effect that no target would
get traffic which subsequently lead to a division by zero error
in a code path which did not expect an empty list of targets.

This change modifies the distribution algorithm as follows:

* The case where all targets receive an equal amount of traffic
has now a fast path which just returns the list of targets.

* For the other case a larger ring of 10.000 slots is always
used to achieve a certain amount of accuracy. In addition, all
targets that have a non-zero weight will use at least one slot
to work around rounding issues and the fact that some servers
would not receive traffic although they are alive and healthy.

This should guarantee an unlimited number of instances for a
single route.
@magiconair magiconair added this to the 1.4 milestone Dec 6, 2016
@magiconair
Copy link
Contributor

Merged to master

@magiconair magiconair modified the milestones: 1.3.6, 1.4 Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants