-
Notifications
You must be signed in to change notification settings - Fork 1
Zookeeper Data
This document describes the information we keep in zookeeper, how it is generated, and how the python client uses it.
Each keyspace is now a global zookeeper path, with sub-directories for its shards and action / actionlog. There is no data present (there used to be, but the vt tools never touched it).
$ zk ls /zk/global/vt/keyspaces/ruser
action
actionlog
shards
The path and sub-paths are created by 'vtctl CreateKeyspace'.
We use the action and actionlog paths for locking only, no process is actively watching these paths.
A shard is a global zookeeper path, with sub-directories for its action / actionlog, and replication graph.
$ zk ls /zk/global/vt/keyspaces/user/shards/10-20
action
actionlog
cell1-0000200278
We use the action and actionlog paths for locking only, no process is actively watching these paths.
A shard path is also a node with a Shard object:
type Shard struct {
// There can be only at most one master, but there may be none. (0)
MasterAlias TabletAlias
// This must match the shard name based on our other conventions, but
// helpful to have it decomposed here.
KeyRange key.KeyRange
// ServedTypes is a list of all the tablet types this shard will
// serve. This is usually used with overlapping shards during
// data shuffles like shard splitting.
ServedTypes []TabletType
// SourceShards is the list of shards we're replicating from,
// using filtered replication.
SourceShards []SourceShard
// Cells is the list of cells that have tablets for this shard.
// It is populated at InitTablet time when a tabelt is added
// in a cell that is not in the list yet.
Cells []string
}
type SourceShard struct {
// Uid is the unique ID for this SourceShard object.
// It is for instance used as a unique index in blp_checkpoint
// when storing the position. It should be unique whithin a
// destination Shard, but not globally unique.
Uid uint32
// the source keyspace
Keyspace string
// the source shard
Shard string
// the source shard keyrange
KeyRange key.KeyRange
// we could add other filtering information, like table list, ...
}
$ zk cat /zk/global/vt/keyspaces/ruser/shards/10-20
{
"MasterAlias": {
"Cell": "nyc",
"Uid": 200278
},
"KeyRange": {
"Start": "10",
"End": "20"
},
"Cells": [
"cell1",
"cell2"
]
}
The shard path and sub-directories are created when the first tablet in that shard is created.
The Shard object is changed when we add tablets in unknown cells, or when we change the master.
A tablet has a path in zookeeper, with its action / actionlog and pid file:
$ zk ls /zk/nyc/vt/tablets/0000200308
action
actionlog
pid
We use the action and actionlog paths for remote execution of actions. vttablet will watch that directory and launch a vtaction for every requested action.
A tablet also has a node of type Tablet:
struct {
Cell string // the zk cell this tablet is assigned to (doesn't change)
Uid uint32 // the server id for this instance
Parent TabletAlias // the globally unique alias for our replication parent - zero if this is the global master
Addr string // host:port for queryserver
SecureAddr string // host:port for queryserver using encrypted connection
MysqlAddr string // host:port for the mysql instance
MysqlIpAddr string // ip:port for the mysql instance - needed to match slaves with tablets and preferable to relying on reverse dns
Keyspace string
Shard string
Type TabletType
State TabletState
// Normally the database name is implied by "vt_" + keyspace. I
// really want to remove this but there are some databases that are
// hard to rename.
DbNameOverride string
KeyRange key.KeyRange
}
$ zk cat /zk/nyc/vt/tablets/0000200308
{
"Cell": "nyc",
"Uid": 200308,
"Parent": {
"Cell": "",
"Uid": 0
},
"Addr": "nyc-db308.nyc.youtube.com:8101",
"SecureAddr": "nyc-db308.nyc.youtube.com:8102",
"MysqlAddr": "nyc-db308.nyc.youtube.com:3306",
"MysqlIpAddr": "10.33.158.138:3306",
"Keyspace": "",
"Shard": "",
"Type": "idle",
"State": "ReadOnly",
"DbNameOverride": "",
"KeyRange": {
"Start": "",
"End": ""
}
}
The Tablet object is created by 'vtctl InitTablet'. Up-to-date information (port numbers, ...) is maintained by the vttablet process. 'vtctl ChangeSlaveType' will also change the Tablet record.
The data maintained by vt tools is as follows: 0 the master is stored in the Shard object 0 the replication links are stored in each cell for each master/slave relationship (in the cell of the slave), in a ShardReplication object.
Whenever the replication graph changes, we rebuild the serving graph for that cell.
The serving graph for a shard is maintained in every cell that contains tablets for that shard. To get all the available keyspaces in a cell, just list the top-level cell serving graph directory:
$ zk ls /zk/nyc/vt/ns
lookup
user
The python client lists that directory at startup to find all the keyspaces.
The keyspace data is stored under /zk//vt/ns/
type SrvShard struct {
// Copied from Shard
KeyRange key.KeyRange
ServedTypes []TabletType
// TabletTypes represents the list of types we have serving tablets // for, in this cell only. TabletTypes []TabletType
// For atomic updates version int64
}
type SrvKeyspace struct {
// Shards to use per type, only contains complete partitions.
Partitions map[TabletType]*KeyspacePartition
// This list will be deprecated as soon as Partitions is used. // List of non-overlapping shards sorted by range. Shards []SrvShard
// List of available tablet types for this keyspace in this cell. // May not have a server for every shard, but we have some. TabletTypes []TabletType
// For atomic updates version int64
}
type KeyspacePartition struct {
// List of non-overlapping continuous shards sorted by range.
Shards []SrvShard
}
$ zk cat /zk/nyc/vt/ns/rlookup
{
"Shards": [
{
"KeyRange": {
"Start": "",
"End": ""
},
}
],
"TabletTypes": [
"master",
"rdonly",
"replica"
]
}
The only way to build this data is to run the following vtctl command:
$ vtctl RebuildKeyspaceGraph
When building a new data center, this command should be run for every keyspace.
Rebuilding a keyspace graph will: 0 find all the shard names in the keyspace from looking at the children of /zk/global/vt/keyspaces//shards 0 rebuild the graph for every shard in the keyspace (see below) 0 find the list of cells to update by collecting the cells for each tablet of the first shard compute, sanity check and update the keyspace graph object in every cell 0 The python client reads the nodes to find the shard map (KeyRanges, TabletTypes, ...)
The shard data is stored under /zk//vt/ns//
$ zk cat /zk/nyc/vt/ns/rlookup/0
{
"KeyRange": {
"Start": "",
"End": ""
}
}
We also have per serving type data under /zk//vt/ns///
$ zk cat /zk/nyc/vt/ns/rlookup/0/master { "entries": [ { "uid": 200274,
"host": "nyc-db274.nyc.youtube.com",
"port": 0,
"named_port_map": {
"_mysql": 3306,
"_vtocc": 8101,
"_vts": 8102
}
}
]
}
The shard serving graph can be re-built using the 'vtctl RebuildShardGraph /' command. However, it is also triggered by any 'vtctl ChangeSlaveType' command, so it is usually not needed. For instance, when vt_bns_monitor takes servers in and out of serving state, it will rebuild the shard graph.
Note this will rebuild the serving graph for all cells, not just one cell.
Rebuilding a shard serving graph will: 0 compute the data to write by looking at all the tablets from the replicaton graph 0 write all the /zk//vt/ns/// nodes everywhere 0 delete any pre-existing /zk//vt/ns/// that is not in use any more 0 compute and write all the /zk//vt/ns// nodes everywhere
The python client reads the per-type data nodes to find servers to talk to. When resolving user.10-20.master, it will try to read /zk/local/vt/ns/user/10-20/master.