Skip to content

SQL: DDL

Nick Zavaritsky edited this page Feb 21, 2017 · 14 revisions

#Problem statement

SQLite utilizes sqlite_master TABLE to maintain a persistent list of tables, indices, views and triggers. In our fork, SQL tables and indices are backed by Tarantool spaces and indices, respectively. Since Tarantool is already capable of tracking its own spaces and indices, sqlite_master looks largely redundant.

Since sqlite_master is to be removed, it's necessary to redesign the schema handling.

In the following sections, we provide information on Tarantool spaces and indices. Then we document the original SQLite schema design. We conclude with the new design proposal.

Spaces and Indices in Tarantool

Tarantool spaces are listed in _space system space. Indices are listed in _index.

_Space Trivia

CREATE TABLE _space (
  id INT PRIMARY KEY,
  owner INT,
  name TEXT,
  engine TEXT,
  field_count INT,
  opts,
  format
)
  • id — spaces are identified by the numeric id;
  • owner — owner id;
  • name — space name;
  • enginevynil or memtx;
  • field_count — number of fields in a tuple, enforced, 0 — variable length tuples;
  • opts — ex: {"temporary": true};
  • format — description of tuple fields, not enforced, ex: [{"name": "column1", "type": "number"}, ...].

_Index Trivia

CREATE TABLE _index (
  id INT,
  iid INT,
  name TEXT,
  type TEXT,
  opts,
  parts,
  PRIMARY KEY (id, iid)
)
  • id — owning space id;
  • iid — numeric id, indices are identified by (id, iid) tuple;
  • name — index name;
  • type — ex: tree;
  • opts — ex: {"unique": true};
  • parts — index key description: column numbers and types, ex: [[0, "scalar"], [1, "scalar"]].

A space may lack indexes. In order for the space to be capable of storing data at least one index is required. An index with iid 0 is the primary index (the name doesn't matter). Creating secondary index before the primary is not permitted. Names and iid-s scope is the owning space. Unlike SQL, two indexes with the same name may coexist provided that they belong to different spaces.

Triggers in Tarantool Spaces

Tarantool takes holistic approach to data management — _space and _index are regular spaces. They are persisted and replicated in exactly the same way regular spaces are. Every space has a list of triggers attached. Triggers fire when a tuple is inserted, removed or updated. When an entry in _space or _index is modified, triggers fire. Consider for instance the following tuple inserted in _space:

[1000, 1, "test", "memtx", 0, {}, []].

The system trigger creates a new space object and initialises it with the provided options. The object is added to internal hash table. Now it is possible to attempt storing and retrieving data from space 1000 (though these attempts will ultimately fail since the space lacks indices yet).

Transactions and _space

There's one burning issue: _space and _index don't support multi-statement transactions. It's only possible to modify these spaces in auto-commit mode (which is a severe limitation we hope to be removed ASAP :)

Schema Handling in Unpatched SQLite

Schema is persisted in sqlite_master table. It's a regular table. It's stored at the fixed location in the database file making it possible to access without the schema.

CREATE TABLE sqlite_master(
  type TEXT,
  name TEXT,
  tbl_name TEXT,
  rootpage INTEGER,
  sql TEXT
)
  • typetable, index, view or trigger;
  • name — entity name;
  • tbl_name — the name of the table the entity belongs to;
  • rootpage — the location of the corresponding BTree in the database file (for tables and indices);
  • sql — SQL statement that created the entity.

Note the sql field — instead of designing a serialisation format to persist in-memory schema objects, SQLite designer decided to simply store SQL statement instead and re-parse it at schema load time. Due to this decision, the parser has two modes, determined by db->init.busy field.

When initialisation is happening, the parser only builds Table-s and Index-es and similar objects. Otherwise, in addition to building objects, a VDBE program is generated to perform the requested operation. Typically, multiple routines are invoked in the course of parsing SQL statement. The db->init.busy checks are interspersed throughout the parser routines.

Consider the stages in processing CREATE TABLE xxx statement.

  1. The text is parsed and VDBE program is generated. Temporary Table and 0+ Index objects are created and used internally by the parser routines. Those temporary objects are ultimately destroyed when the parser completes it's job.
  2. The in-memory schema isn't updated yet, neither is sqlite_master table. Storage for the new table isn't allocated yet.
  3. VDBE program execution starts.
  4. BTree-s are allocated.
  5. sqlite_master is updated.
  6. SELECT name, rootpage, sql FROM sqlite_master WHERE tbl_name=='xxx' and type != 'trigger' fetches back the rows describing the new table.
  7. The same loader that fetched the schema from the database file processes the results and updates the schema. Internally, the parser is invoked again for each row.
  8. The in-memory schema is now updated.
  9. VDBE execution ends.

Bonus Trivia

Multiple entries in sqlite_master may be created by a single CREATE TABLE statement. Consider

CREATE TABLE person (id INTEGER PRIMARY KEY, name TEXT UNIQUE)

Not only does the statement create a table, but two indices as well: the first one to implement PRIMARY KEY and the second one to implement UNIQUE constraint. The two implicit indices are named sqlite_autoindex_person_1 and sqlite_autoindex_person_2, respectively. The records describing implicit indices in sqlite_master have an empty sql field.

If the table is WITHOUT ROWID, the data is actually stored in sqlite_autoindex_person_1 index and a BTree for the table isn't created. In the later case, the entry for sqlite_autoindex_person_1 is omitted from sqlite_master since the root page number is stored in the table entry instead.

There's a schema cookie that increments after every schema change. The intention is to notice changes made by other processes (the database file is unlocked when there's no active transactions). When a schema change is detected is is reloaded completely.

New Design Proposal

  1. We expect to integrate SQL into Tarantool more tightly in the future. But for now a certain level of duplication is to be expected: Tarantool in-memory schema will coexist with the SQLite's one.

  2. The only sanctioned method of creating SQLite in-memory schema objects is through the parser. This may change in the future but in the meantime we are being cautious.

  3. We need to store SQL statements in the way reminiscent of the sqlite_master table. Conveniently, both _space and _index have extendable opts dictionary. We add sql there.

  4. In order to populate in-memory SQLite schema, we process _space and _index and synthesise rows in sqlite_master format. We reuse SQLite original schema loader.

  5. For spaces that weren't created by SQL and hence lacks sql statement, we synthesise one.

  6. Tarantool own in-memory schema doesn't need sql strings — extracted and discarded at schema load time.

  7. _Space and _index are added to SQLite in-memory schema using generic Tarantool/SQLite schema sync facility which is to be defined later.

  8. CREATE TABLE creates an entry in _space using generic table modification opcodes.

  9. CREATE INDEX does the same in _index.

  10. CREATE VIEW creates a space without indices (views and tables share the namespace).

  11. CREATE TRIGGER needs a new dedicated space.

  12. After a space or an index is actually created, it must be added to SQLite in-memory schema. Vanilla SQLite performed SELECT from the sqlite_master to fetch rows defining the new entity and invoked the schema loader on them.

  13. There's no sqlite_master anymore. We have two options — either rely on the generic schema sync facility, or generate the row data programmatically. The later is possible since the VDBE program had inserted these rows in the first place. We're going to evaluate both approaches.

Generic Schema Sync Facility

Option A: Schema Cookie

Tarantool already has the mechanism similar to SQLite's schema version cookie. When the schema changes, purge in-memory schema and perform a full reload.

Option B: OnReplace Triggers

Once a space is updated propagate changes into SQLite.

Option C: OnReplace Triggers Scheduling OnCommit Actions

This is similar to B but delays propagation until a transaction commit.

Evaluation

A is not particularly efficient since the in-memory schema is often purged and reloaded from scratch.

B is efficient since only the pointed changes are made to the in-memory schema. Additionally, B allows for a simpler VDBE program, since the in-memory schema syncs automatically once a new space or index is created (see item 13 above).

However, B is notoriously more complex — not only do we need to take action when a new record appears or the old one disappears, we should be capable of executing arbitrary changes and guessing the intent. Ex: ALTER TABLE changes the sql statement. But only limited changes are supported in ALTER TABLE. When sql statement changes we may only drop and reload portions in the in-memory schema that describe the table being changed.

Since transactions are broken C doesn't make sense at all. Additionally, C violates VDBE semantics — once a table is created it is ready to receive data in the same program, but C defers table creation until COMMIT.

Transactions are Broken?

Yes, they are. It's not possible to modify _space or _index in a multi-statement transaction. Hence some operations that ought to be atomic — ex: table creation are performed in multiple sequential transactions:

  1. INSERT INTO _space
  2. COMMIT
  3. For each implicit index
  • INSERT INTO _index
  • COMMIT

I admit that B may create issues when the trigger installed by the schema sync facility propagates changes but the subsequent trigger in the chain ultimately aborts the transaction. End result — ‘ghost’ table in the SQLite in-memory table that may alias other spaces or simply fail any operation attempted.

However, more severe issues are possible due to the lack of atomicity.

Conclusion

We're doing B, approved by @kostja.

Clone this wiki locally