Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: use sensible zone configs for critical system and time series ranges #14990

Closed
a-robinson opened this issue Apr 17, 2017 · 28 comments
Closed
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) docs-todo S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Milestone

Comments

@a-robinson
Copy link
Contributor

While it makes sense for small clusters, our current default zone configuration that only creates 3 replicas of all data by default is somewhat risky for large clusters, where it may be preferable to keep more than 3 replicas of critical system ranges.

This can be addressed via documentation for 1.0 (cockroachdb/docs#1280, cockroachdb/docs#1248), but before 1.1 we should do some testing with different configurations and consider setting more replicas of the system ranges by default.

@a-robinson a-robinson added this to the 1.1 milestone Apr 17, 2017
@tbg
Copy link
Member

tbg commented Jun 16, 2017

A sensible choice would be to replicate these ranges with factor max(min(nodes_in_cluster, 5), highest_replication_factor_for_any_zone). By that token, you could be assured that if you run a range with, say, x11, you can actually tolerate 5 dead replicas (and not just whatever the critical system ranges top out at).

@tbg
Copy link
Member

tbg commented Jun 16, 2017

via @bdarnell:

yeah. we should eventually move to more fine-grained defaults for different system sub-ranges (high replication factor and long ttl for range metadata, high replication and low ttl for liveness, low replication for timeseries)

@tbg tbg changed the title storage: Replicate critical system ranges more than 3 ways by default storage: use sensible zone configs for critical system and time series ranges Jun 16, 2017
@bdarnell
Copy link
Contributor

The .system special zone currently includes too much. It has both the liveness span (which wants to have a short TTL) and the system descriptor/namespace tables (which need to have a TTL at least as high as the TTL for any other table. Time travel queries are limited to the shorter of the TTLs of the table itself and the table descriptor tables).

petermattis added a commit to petermattis/cockroach that referenced this issue Aug 14, 2017
Default the .meta zone config to 5 replicas and 1h GC TTL. The higher
replication reflects the relative danger of significant data loss and
unavailability for the meta ranges. The shorter GC TTL reflects the lack
of need for ever performing historical queries on these ranges coupled
with the desire to keep the meta ranges smaller.

See cockroachdb#16266
See cockroachdb#14990
@cuongdo
Copy link
Contributor

cuongdo commented Aug 22, 2017

@petermattis this seems like a risky change at this point; should this move to 1.2?

@bdarnell
Copy link
Contributor

Yes, we already agreed to move this to 1.2 in #17628

@bdarnell bdarnell modified the milestones: 1.2, 1.1 Aug 22, 2017
petermattis added a commit to petermattis/cockroach that referenced this issue Dec 10, 2017
Default the .meta zone config to 1h GC TTL. The shorter GC TTL reflects
the lack of need for ever performing historical queries on these ranges
coupled with the desire to keep the meta ranges smaller.

See cockroachdb#16266
See cockroachdb#14990
petermattis added a commit to petermattis/cockroach that referenced this issue Dec 11, 2017
Default the .meta zone config to 1h GC TTL and default the .liveness
zone config to 1m GC TTL. The shorter GC TTLs reflect the lack of need
for ever performing historical queries on these ranges coupled with the
desire to keep the meta and liveness ranges smaller.

See cockroachdb#16266
See cockroachdb#14990
petermattis added a commit to petermattis/cockroach that referenced this issue Dec 13, 2017
Default the .meta zone config to 1h GC TTL and default the .liveness
zone config to 10m GC TTL. The shorter GC TTLs reflect the lack of need
for ever performing historical queries on these ranges coupled with the
desire to keep the meta and liveness ranges smaller.

See cockroachdb#16266
See cockroachdb#14990
@tbg tbg added this to the 2.1 milestone Jul 19, 2018
@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Jul 22, 2018
@tbg
Copy link
Member

tbg commented Aug 16, 2018

I added a docs-todo for a known limitation about this, see #14990.

@tbg
Copy link
Member

tbg commented Aug 21, 2018

Filed #28901 to track cascading zone configs.

@nstewart nstewart added S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. labels Sep 18, 2018
@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Sep 20, 2018
@petermattis
Copy link
Collaborator

@m-schneider Is there anything left to do here?

@m-schneider
Copy link
Contributor

No, closing.

@tbg
Copy link
Member

tbg commented Oct 8, 2018

I'm a bit confused. I assume that #27349 is the PR that closes this issue (though it didn't refer it prior to this commit).

That PR is sparse on description, but from the code it looks like what should happen is that if I set up a new five node cluster, I'll get my critical system ranges 5x replicated.

That doesn't seem to be the case:

root@:26257/> select * from crdb_internal.ranges;
  range_id |     start_key     |         start_pretty          |      end_key      |          end_pretty           | database |      table       | index | replicas | lease_holder
+----------+-------------------+-------------------------------+-------------------+-------------------------------+----------+------------------+-------+----------+--------------+
         1 |                   | /Min                          | \004              | /System/""                    |          |                  |       | {1,4,5}  |            1
         2 | \004              | /System/""                    | \004\000liveness- | /System/NodeLiveness          |          |                  |       | {1,3,2}  |            1
         3 | \004\000liveness- | /System/NodeLiveness          | \004\000liveness. | /System/NodeLivenessMax       |          |                  |       | {1,3,5}  |            3
         4 | \004\000liveness. | /System/NodeLivenessMax       | \004tsd           | /System/tsd                   |          |                  |       | {4,5,3}  |            4
         5 | \004tsd           | /System/tsd                   | \004tse           | /System/"tse"                 |          |                  |       | {4,5,3}  |            5
         6 | \004tse           | /System/"tse"                 | \210              | /Table/SystemConfigSpan/Start |          |                  |       | {1,3,2}  |            1
         7 | \210              | /Table/SystemConfigSpan/Start | \223              | /Table/11                     |          |                  |       | {1,3,2}  |            1
         8 | \223              | /Table/11                     | \224              | /Table/12                     | system   | lease            |       | {1,5,2}  |            2
         9 | \224              | /Table/12                     | \225              | /Table/13                     | system   | eventlog         |       | {4,3,2}  |            2
        10 | \225              | /Table/13                     | \226              | /Table/14                     | system   | rangelog         |       | {5,4,2}  |            4
        11 | \226              | /Table/14                     | \227              | /Table/15                     | system   | ui               |       | {1,2,3}  |            1
        12 | \227              | /Table/15                     | \230              | /Table/16                     | system   | jobs             |       | {5,4,2}  |            5
        13 | \230              | /Table/16                     | \231              | /Table/17                     |          |                  |       | {1,3,4}  |            3
        14 | \231              | /Table/17                     | \232              | /Table/18                     |          |                  |       | {4,5,2}  |            5
        15 | \232              | /Table/18                     | \233              | /Table/19                     |          |                  |       | {1,2,4}  |            2
        16 | \233              | /Table/19                     | \234              | /Table/20                     | system   | web_sessions     |       | {1,5,3}  |            3
        17 | \234              | /Table/20                     | \235              | /Table/21                     | system   | table_statistics |       | {1,5,2}  |            2
        18 | \235              | /Table/21                     | \236              | /Table/22                     | system   | locations        |       | {1,3,2}  |            1
        19 | \236              | /Table/22                     | \237              | /Table/23                     |          |                  |       | {4,5,2}  |            2
        20 | \237              | /Table/23                     | \377\377          | /Max                          | system   | role_members     |       | {4,3,5}  |            5
(20 rows)

Time: 29.585ms

warning: no current database set. Use SET database = <dbname> to change, CREATE DATABASE to make a new database.
root@:26257/> select * from crdb_internal.gossip_liveness;
  node_id | epoch |       expiration       | draining | decommissioning
+---------+-------+------------------------+----------+-----------------+
        1 |     1 | 1538992670.035649687,0 |  false   |      false
        2 |     1 | 1538992670.740451009,0 |  false   |      false
        3 |     1 | 1538992671.575627883,0 |  false   |      false
        4 |     1 | 1538992672.506993000,0 |  false   |      false
        5 |     1 | 1538992673.226969883,0 |  false   |      false
(5 rows)

@m-schneider are my assumptions wrong? What should happen in my situation?

@tbg tbg reopened this Oct 8, 2018
@m-schneider
Copy link
Contributor

Taking a look.

@m-schneider
Copy link
Contributor

How did you start up the cluster? I tried to reproduce on master, but I got the expected behavior:

root@:26257/defaultdb> select * from crdb_internal.ranges;                                                                                                                                                                                                                                                                                                         range_id |     start_key     |         start_pretty          |      end_key      |          end_pretty           | database_name |    table_name    | index_name |  replicas   | lease_holder  
+----------+-------------------+-------------------------------+-------------------+-------------------------------+---------------+------------------+------------+-------------+--------------+
         1 |                   | /Min                          | \004              | /System/""                    |               |                  |            | {1,2,3,4,5} |            1  
         2 | \004              | /System/""                    | \004\000liveness- | /System/NodeLiveness          |               |                  |            | {1,2,3,4,5} |            1  
         3 | \004\000liveness- | /System/NodeLiveness          | \004\000liveness. | /System/NodeLivenessMax       |               |                  |            | {1,2,3,4,5} |            1  
         4 | \004\000liveness. | /System/NodeLivenessMax       | \004tsd           | /System/tsd                   |               |                  |            | {1,2,3,4,5} |            1  
         5 | \004tsd           | /System/tsd                   | \004tse           | /System/"tse"                 |               |                  |            | {1,2,5}     |            1  
         6 | \004tse           | /System/"tse"                 | \210              | /Table/SystemConfigSpan/Start |               |                  |            | {1,2,3,4,5} |            4  
         7 | \210              | /Table/SystemConfigSpan/Start | \223              | /Table/11                     |               |                  |            | {1,2,3,4,5} |            1  
         8 | \223              | /Table/11                     | \224              | /Table/12                     | system        | lease            |            | {1,2,3,4,5} |            3  
         9 | \224              | /Table/12                     | \225              | /Table/13                     | system        | eventlog         |            | {1,2,3,4,5} |            1  
        10 | \225              | /Table/13                     | \226              | /Table/14                     | system        | rangelog         |            | {1,2,3,4,5} |            1  
        11 | \226              | /Table/14                     | \227              | /Table/15                     | system        | ui               |            | {1,2,3,4,5} |            5  
        12 | \227              | /Table/15                     | \230              | /Table/16                     | system        | jobs             |            | {1,2,3,4,5} |            2  
        13 | \230              | /Table/16                     | \231              | /Table/17                     |               |                  |            | {1,2,3,4,5} |            1  
        14 | \231              | /Table/17                     | \232              | /Table/18                     |               |                  |            | {1,2,3,4,5} |            1  
        15 | \232              | /Table/18                     | \233              | /Table/19                     |               |                  |            | {1,2,3,4,5} |            4  
        16 | \233              | /Table/19                     | \234              | /Table/20                     | system        | web_sessions     |            | {1,2,3,4,5} |            2  
        17 | \234              | /Table/20                     | \235              | /Table/21                     | system        | table_statistics |            | {1,2,3,4,5} |            1  
        18 | \235              | /Table/21                     | \236              | /Table/22                     | system        | locations        |            | {1,2,3,4,5} |            1  
        19 | \236              | /Table/22                     | \237              | /Table/23                     |               |                  |            | {1,2,3,4,5} |            1  
        20 | \237              | /Table/23                     | \377\377          | /Max                          | system        | role_members     |            | {1,2,3,4,5} |            1  
(20 rows)

Time: 158.343155ms

root@:26257/defaultdb> select * from crdb_internal.gossip_liveness;
  node_id | epoch |       expiration       | draining | decommissioning |            updated_at             
+---------+-------+------------------------+----------+-----------------+----------------------------------+
        1 |     1 | 1539012150.769984390,0 |  false   |      false      | 2018-10-08 15:22:21.273795+00:00  
        2 |     1 | 1539012150.863685273,0 |  false   |      false      | 2018-10-08 15:22:21.369532+00:00  
        3 |     1 | 1539012150.852281022,0 |  false   |      false      | 2018-10-08 15:22:21.361692+00:00  
        4 |     1 | 1539012150.877309202,0 |  false   |      false      | 2018-10-08 15:22:21.382665+00:00  
        5 |     1 | 1539012150.893886556,0 |  false   |      false      | 2018-10-08 15:22:21.399011+00:00  

@tbg
Copy link
Member

tbg commented Oct 8, 2018

I used roachdemo and clicked the add button in the ui five times :-)

@tbg
Copy link
Member

tbg commented Oct 8, 2018

Let me try that again.. it'd be puzzling if that made a difference.

@tbg
Copy link
Member

tbg commented Oct 8, 2018

Hrm. Unfortunately, it behaves as advertised now. I wonder what the difference was? How's our test coverage for this?

@m-schneider
Copy link
Contributor

We have pretty extensive testing in allocator_test.go for various combinations of available vs alive nodes.

@a-robinson
Copy link
Contributor Author

That doesn't necessarily mean that it always works end-to-end, though.

@tschottdorf do you know which version of cockroach you were on and whether roachdemo was resuming a preexisting cluster or if it initialized a new one? If the cluster was initialized before #27349 / #30480 then you wouldn't see the new behavior regardless of which version you were using.

@m-schneider
Copy link
Contributor

I haven't been able to reproduce after a couple of attempts and after using roachdemo. Should we close for now? If you see this again can you please run:
select * from crdb_internal.kv_store_status;

@tbg
Copy link
Member

tbg commented Oct 10, 2018

@a-robinson I looked after reading your comment two days ago but the buffer had been lost. I also gave this a a few more rounds but it just worked. Still very weird. I make build pretty regularly, though perhaps I had a stray cockroach binary from a release branch that I used by accident.
Sorry about the noise!

@tbg tbg closed this as completed Oct 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) docs-todo S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

No branches or pull requests

7 participants