Skip to content

Vinyl Architecture

Roman Tsisyk edited this page Jun 27, 2016 · 12 revisions

Vinyl is a new disk engine for Tarantool 1.7.

Design

Sophia's main storage unit is a Node.

Each Node represents a single file with associated in-memory region index and two in-memory key indexes. Node file consists of Branches.

Each Branch consists of a set of sorted Regions and Region Index.

A single Region holds keys with values. It has the same semantical meaning as a B-Tree page, but organized in a different way. It does not have a tree structure or any internal page-to-page relationships and thus no meta-data overhead.

A Region Index is represented by an ordered range of regions with their min and max keys and on-disk reference. Regions never overlap.

A Key Index is very similar to LSM zero-level (memtable), but has a different key lifecycle. All modifications first get into the index and hold up until they are explicitly removed by the merger.

Before getting added to the in-memory Key Index, modifications are first written to the Write-Ahead Log.

Lifecycle

Database lifecycle is organized in terms of two major operations: Branch and Compaction.

When a Node's in-memory Key Index size reaches a special watermark value or global memory limit, Branch operation is scheduled.

When some Node branch counter reaches a special watermark value, Compaction operation is scheduled.

Branch operation

  1. rotate in-memory Key Index (use second one empty) (Node updates during Branch goes to second index)
  2. create new Regions and Region Index from Key Index
  3. create new Node Branch
  4. sequentially write Branch to the end of Node file
  5. sync
  6. free index and rotate

Compaction operation

  1. sequentially read Node file into memory
  2. create iterator for each Branch
  3. merge and split key-stream:
  4. create one or more Nodes
  5. delete Node
  6. sequentially write new Node or Nodes
  7. sync
  8. redistribute online updates between new Nodes
  9. remove old Node
  10. rename new Node or Nodes when completed

Optimization Game

All background operations are planned by special scheduler.

There is a game between available memory, a number of Branches and Search times.

Each additional branch says that there is a possible additional disk access during the search. During the search, only branch regions that have min >= key <= max are examined. In the same time, it is unable to maintain memory limits without branching, because compaction times are greater than possible rate of incoming data.

Sophia is designed to be read optimized. There is a high possibility that latest created Branches (hot data) are stored in the file system cache. Scheduler is aware about nodes which have largest in-memory Key Index and biggest number of Branches. These are processed first.

Additionally all operations are planned taking current system state in account, like memory usage statistics, current load profiler (redzone), operations priorities, checkpoint, backup, etc.

Sophia compaction is multi-threaded. Each worker (thread) requests scheduler for a new task. Basic unit of a background task is an operation on a single Node.

Sophia is designed to efficiently utilize available memory. If there is more memory available, then branch/compaction operations become more infrequent and system becomes more disk-efficient. Best performance can be obtained with no memory limit set. Sophia is Hard-Drive (and Flash) friendly, since all operations are delayed and executed in large sequential reads and writes, without overwrite.

Garbage Collection

Garbage collection (MVCC related) is executed automatically by Compaction task.

Also, scheduler periodically checks if there are any nodes which have large percentage of transactional versions (duplicates) stored per node.

Algorithmic Complexity

Sophia has following algorithmic complexity (in terms of disk access):

set worst case is O(1) write-ahead-log append-only key write + in-memory node index search + in-memory index insert

delete worst case is O(1) (delete operation is equal to set)

get worst case is amortized O(max_branch_count_per_node) random region read from a single node file, which itself does in-memory key index search + in-memory region search

range worst case of full database traversal is amortized O(total_region_count) + in-memory key-index searches for each node

Disk Layout

[struct sdseal;
  [struct sdpageheader;
    [struct sdv];
    [values];
  ]
  struct sdindexheader;
  [struct sdindexpage]; -- for each sdpageheader
  [page min key; page max key;]
  {struct sdindexamqf; amqf blob;}
]

struct PACKED sdseal {
	uint32_t  crc;
	struct srversion version;
	uint8_t   flags;
	uint32_t  index_crc;
	uint64_t  index_offset;
};

struct PACKED sdpageheader {
	uint32_t crc;
	uint32_t crcdata;
	uint32_t count;
	uint32_t countdup;
	uint32_t sizeorigin;
	uint32_t sizekeys;
	uint32_t size;
	uint64_t lsnmin;
	uint64_t lsnmindup;
	uint64_t lsnmax;
	uint32_t reserve;
};

struct PACKED sdv {
	uint32_t offset;
	uint8_t  flags;
	uint64_t lsn;
	uint32_t size;
};

struct PACKED sdindexheader {
	uint32_t  crc;
	struct srversion version;
	struct sdid      id;
	uint64_t  offset;
	uint32_t  size;
	uint32_t  sizevmax;
	uint32_t  count;
	uint32_t  keys;
	uint64_t  total;
	uint64_t  totalorigin;
	uint64_t  lsnmin;
	uint64_t  lsnmax;
	uint32_t  dupkeys;
	uint64_t  dupmin;
	uint32_t  extension;
	uint8_t   extensions;
	char      reserve[31];
};

struct PACKED sdindexpage {
	uint64_t offset;
	uint32_t offsetindex;
	uint32_t size;
	uint32_t sizeorigin;
	uint16_t sizemin;
	uint16_t sizemax;
	uint64_t lsnmin;
	uint64_t lsnmax;
};

struct PACKED sdindexamqf {
	uint8_t  q, r;
	uint32_t entries;
	uint32_t size;
	uint64_t table[];
};
Clone this wiki locally