Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustering: influxdb 0.9.0-rc23 panics when doing a GET with merge_metrics in a 3 node cluster #2272

Closed
agreentree opened this issue Apr 13, 2015 · 15 comments · Fixed by #2336
Assignees
Milestone

Comments

@agreentree
Copy link

influxdb.log: panic: distributed queries not implemented yet and there are too many shards in this group
the command I use to get the panic is: monasca measurement-list cpu.idle_perc 1970 --merge_metrics

@agreentree
Copy link
Author

I understand if this is a feature that hasn't been implemented yet, but the same commands worked with RC19, so we have written many regression tests with merge_metrics in the GET that now fail.

@otoolep
Copy link
Contributor

otoolep commented Apr 13, 2015

OK, this may be a regression since we did change the query engine.

Can you supply a sequence of curl commands (writing data, then reading it) which brings out this issue?

https://github.com/influxdb/influxdb/blob/master/CONTRIBUTING.md#bug-reports

@agreentree
Copy link
Author

this is the curl command I use to reproduce it:
curl -i -X GET -H 'X-Auth-User: mini-mon' -H 'X-Auth-Token: 46968286c92a423da87b2eae570c422a' -H 'X-Auth-Key: password' -H 'Accept: application/json' -H 'User-Agent: python-monascaclient' -H 'Content-Type: application/json' --cacert /usr/local/share/ca-certificates/monasca_test_ca.crt https://mon-ae1test-monasca01.useast.hpcloud.net:8080/v2.0/metrics/measurements?start_time=1970&merge_metrics=True&name=cpu.idle_perc

@agreentree
Copy link
Author

I just reproduced it with influxdb stand-alone as follows:

  • installed rc23 on 3 nodes, modified the join-urls on the 2 worker nodes to point to the master
  • ran the following curl commands (from the influxdb 0.9 doc. page):

curl -G 'http://localhost:8086/query' --data-urlencode "db=mydb" --data-urlencode "q=SELECT value FROM cpu_load_short WHERE region='us-west'"

curl -XPOST 'http://localhost:8086/write' -d ' {
"database": "mydb",
"retentionPolicy": "default",
"points": [
{
"name": "cpu_load_short",
"tags": {
"host": "server01",
"region": "us-west"
},
"timestamp": "2009-11-10T23:00:00Z",
"fields": {
"value": 0.64
}
}
]
}
'

curl -G 'http://localhost:8086/query' --data-urlencode "db=mydb" --data-urlencode "q=SELECT value FROM cpu_load_short WHERE region='us-west'"

influxdb fails with panic message from original comment

@jwilder
Copy link
Contributor

jwilder commented Apr 13, 2015

From: https://github.com/influxdb/influxdb/blob/master/tx.go#L144, it looks like this is triggered because the replication factor is less than the number of servers in the cluster. I believe the default replication factor is 1 for the default retention policy and you are using a 3 node cluster. You might try creating a retention policy w/ a replication factor of 3 and specifying that RP in your writes as a work around.

@otoolep is this panic still needed though?

@otoolep
Copy link
Contributor

otoolep commented Apr 13, 2015

Yeah, that could be an issue. Let me look into it, we may not need the explicit panic any longer.

@agreentree
Copy link
Author

I resolved the issue in my enviironment by adding DEFAULT to the database creation so that the correct retention policy and replication factor get set; no opinion on whether that panic should still occur if the replication factor doesn't match the clustered env, perhaps a more informative message would help.

@otoolep
Copy link
Contributor

otoolep commented Apr 14, 2015

@beckettsean beckettsean added this to the 0.9.0 milestone Apr 14, 2015
@agreentree
Copy link
Author

this works for me now

@svscorp
Copy link

svscorp commented Apr 17, 2015

@beckettsean @levicook

Facing the same issue on rc25. 3 server setup, all 3 are both broker and data-node.

Then I put some data

curl -XPOST 'http://influxdb:8086/write' -d '
{
    "database": "ilia",
    "retentionPolicy": "default",
    "points": [
        {
            "name": "cpu",
            "tags": {
                "host": "server1",
                "region": "nl"
            },
            "timestamp": "2015-04-10T15:00:00Z",
            "fields": {
                "value": 10.64
            }
        },
        {
            "name": "cpu",
            "tags": {
                "host": "server1",
                "region": "nl"
            },
            "timestamp": "2015-04-10T15:05:00Z",
            "fields": {
                "value": 20.00
            }
        },
        {
            "name": "cpu",
            "tags": {
                "host": "server1",
                "region": "nl"
            },
            "timestamp": "2015-04-10T15:10:00Z",
            "fields": {
                "value": 25.00
            }
        },
        {
            "name": "cpu",
            "tags": {
                "host": "server1",
                "region": "nl"
            },
            "timestamp": "2015-04-10T15:15:00Z",
            "fields": {
                "value": 35.01
            }
        }
    ]
}'

Then I run

curl -G http://influxdb:8086/query?pretty=true --data-urlencode "q=SELECT * FROM cpu" --data-urlencode "db=ilia"

And it crashed the node.

Then I found this issue. Created the policy with replicaN = 2 and it started to work. But that's weak, default retention policy has replicaN = 1 (which is not more than the number of servers in the cluster). Why it is running into the panic?

Although, In <0.9.0 there was an option to set the default replica factor and in 0.9.0 there is not.

@agreentree agreentree reopened this Apr 17, 2015
@agreentree
Copy link
Author

even though this works when the replication factor matches the number of nodes in the cluster, it sounds like there is still a question around whether the panic should occur if the replication factor is less than the number of clustered nodes.

@jwilder jwilder self-assigned this Apr 17, 2015
@jwilder jwilder closed this as completed Apr 17, 2015
@jwilder jwilder reopened this Apr 17, 2015
jwilder added a commit that referenced this issue Apr 17, 2015
Fixes #2272

There was previously a explict panic put in the query engine to prevent
queries where the number of shards was not equal to the number of data nodes
in the cluster.  This was waiting for the distributed queries branch to land
but was not removed when that landed.

There may be a more efficient way to do fix this but this fix simply queries
all the shards and merges their outputs.  Previously, the code assumed that
only one shard would be hit.  Querying multiple shards ended up producing
duplicate values during the map phase so the map output needed to be merged
as opposed to appended to avoid the dups.
jwilder added a commit that referenced this issue Apr 18, 2015
Fixes #2272

There was previously a explict panic put in the query engine to prevent
queries where the number of shards was not equal to the number of data nodes
in the cluster.  This was waiting for the distributed queries branch to land
but was not removed when that landed.
@svscorp
Copy link

svscorp commented Apr 18, 2015

@jwilder how do I configure default amount of shards?

jwilder added a commit that referenced this issue Apr 19, 2015
Fixes #2272

There was previously a explict panic put in the query engine to prevent
queries where the number of shards was not equal to the number of data nodes
in the cluster.  This was waiting for the distributed queries branch to land
but was not removed when that landed.
@jwilder
Copy link
Contributor

jwilder commented Apr 19, 2015

@svscorp You should be able to change the replication factor using the CLI

$ influx
> alter retention policy "default" on "mydb" replication 3 default

You could also create new retention policy and mark is as default

$ influx
> create retention policy "myrp" on "mydb" duration 1d replication 2 default

jwilder added a commit that referenced this issue Apr 19, 2015
Fixes #2272

There was previously a explict panic put in the query engine to prevent
queries where the number of shards was not equal to the number of data nodes
in the cluster.  This was waiting for the distributed queries branch to land
but was not removed when that landed.
@svscorp
Copy link

svscorp commented Apr 20, 2015

@jwilder I get it, my question was about some kind of configuration, how it was done in <0.9. But that's okay, means I need to execute the ALTER query after I created a database.

But I think it would be wise to have a configuration for applying the default replication factor to all the new databases created with 'default' key. Because it's easy to lost that bit, you need to always make 1 query more to have the replication factor you need (or that you need to include it to the create db query).

What do you think?

@svscorp
Copy link

svscorp commented Apr 20, 2015

Also, here it is hardcoded:

https://github.com/influxdb/influxdb/blob/master/server.go#L1104

maybe it make sense to define it as constant or something (or again, the configuration)?

jwilder added a commit that referenced this issue Apr 20, 2015
Fixes #2272

There was previously a explict panic put in the query engine to prevent
queries where the number of shards was not equal to the number of data nodes
in the cluster.  This was waiting for the distributed queries branch to land
but was not removed when that landed.
jwilder added a commit that referenced this issue Apr 21, 2015
jwilder added a commit that referenced this issue Apr 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants