-
Notifications
You must be signed in to change notification settings - Fork 0
SQL: DDL
#Problem statement
SQLite utilizes sqlite_master
TABLE to maintain a persistent list of tables, indices, views and triggers.
In our fork, SQL tables and indices are backed by Tarantool spaces and indices, respectively. Since Tarantool is already capable of tracking its own spaces and indices, sqlite_master
looks largely redundant.
Since sqlite_master
is to be removed, it's necessary to redesign the schema handling.
In the following sections, we provide information on Tarantool spaces and indices. Then we document the original SQLite schema design. We conclude with the new design proposal.
Tarantool spaces are listed in _space
system space. Indices are listed in _index
.
CREATE TABLE _space (
id INT PRIMARY KEY,
owner INT,
name TEXT,
engine TEXT,
field_count INT,
opts,
format
)
-
id
— spaces are identified by the numeric id; -
owner
— owner id; -
name
— space name; -
engine
—vynil
ormemtx
; -
field_count
— number of fields in a tuple, enforced, 0 — variable length tuples; -
opts
— ex:{"temporary": true}
; -
format
— description of tuple fields, not enforced, ex:[{"name": "column1", "type": "number"}, ...]
.
CREATE TABLE _index (
id INT,
iid INT,
name TEXT,
type TEXT,
opts,
parts,
PRIMARY KEY (id, iid)
)
-
id
— owning space id; -
iid
— numeric id, indices are identified by(id, iid)
tuple; -
name
— index name; -
type
— ex:tree
; -
opts
— ex:{"unique": true}
; -
parts
— index key description: column numbers and types, ex:[[0, "scalar"], [1, "scalar"]]
.
A space may lack indexes. In order for the space to be capable of storing data at least one index is required. An index with iid
0 is the primary index (the name doesn't matter). Creating secondary index before the primary is not permitted. Names and iid-s scope is the owning space. Unlike SQL, two indexes with the same name may coexist provided that they belong to different spaces.
Tarantool takes holistic approach to data management — _space
and _index
are regular spaces. They are persisted and replicated in exactly the same way regular spaces are. Every space has a list of triggers attached. Triggers fire when a tuple is inserted, removed or updated. When an entry in _space
or _index
is modified, triggers fire. Consider for instance the following tuple inserted in _space
:
[1000, 1, "test", "memtx", 0, {}, []]
.
The system trigger creates a new space object and initialises it with the provided options. The object is added to internal hash table. Now it is possible to attempt storing and retrieving data from space 1000
(though these attempts will ultimately fail since the space lacks indices yet).
There's one burning issue: _space
and _index
don't support multi-statement transactions. It's only possible to modify these spaces in auto-commit mode (which is a severe limitation we hope to be removed ASAP :)
Schema is persisted in sqlite_master
table. It's a regular table. It's stored at the fixed location in the database file making it possible to access without the schema.
CREATE TABLE sqlite_master(
type TEXT,
name TEXT,
tbl_name TEXT,
rootpage INTEGER,
sql TEXT
)
-
type
—table
,index
,view
ortrigger
; -
name
— entity name; -
tbl_name
— the name of the table the entity belongs to; -
rootpage
— the location of the corresponding BTree in the database file (for tables and indices); -
sql
— SQL statement that created the entity.
Note the sql
field — instead of designing a serialisation format to persist in-memory schema objects, SQLite designer decided to simply store SQL statement instead and re-parse it at schema load time. Due to this decision, the parser has two modes, determined by db->init.busy
field.
When initialisation is happening, the parser only builds Table
-s and Index
-es and similar objects. Otherwise, in addition to building objects, a VDBE program is generated to perform the requested operation. Typically, multiple routines are invoked in the course of parsing SQL statement. The db->init.busy
checks are interspersed throughout the parser routines.
Consider the stages in processing CREATE TABLE xxx
statement.
- The text is parsed and VDBE program is generated. Temporary
Table
and 0+Index
objects are created and used internally by the parser routines. Those temporary objects are ultimately destroyed when the parser completes it's job. - The in-memory schema isn't updated yet, neither is
sqlite_master
table. Storage for the new table isn't allocated yet. - VDBE program execution starts.
- BTree-s are allocated.
-
sqlite_master
is updated. -
SELECT name, rootpage, sql FROM sqlite_master WHERE tbl_name=='xxx' and type != 'trigger'
fetches back the rows describing the new table. - The same loader that fetched the schema from the database file processes the results and updates the schema. Internally, the parser is invoked again for each row.
- The in-memory schema is now updated.
- VDBE execution ends.
Multiple entries in sqlite_master
may be created by a single CREATE TABLE
statement. Consider
CREATE TABLE person (id INTEGER PRIMARY KEY, name TEXT UNIQUE)
Not only does the statement create a table, but two indices as well: the first one to implement PRIMARY KEY
and the second one to implement UNIQUE
constraint. The two implicit indices are named sqlite_autoindex_person_1
and sqlite_autoindex_person_2
, respectively. The records describing implicit indices in sqlite_master
have an empty sql
field.
If the table is WITHOUT ROWID
, the data is actually stored in sqlite_autoindex_person_1
index and a BTree for the table isn't created. In the later case, the entry for sqlite_autoindex_person_1
is omitted from sqlite_master
since the root page number is stored in the table entry instead.
There's a schema cookie that increments after every schema change. The intention is to notice changes made by other processes (the database file is unlocked when there's no active transactions). When a schema change is detected is is reloaded completely.
-
We expect to integrate SQL into Tarantool more tightly in the future. But for now a certain level of duplication is to be expected: Tarantool in-memory schema will coexist with the SQLite's one.
-
The only sanctioned method of creating SQLite in-memory schema objects is through the parser. This may change in the future but in the meantime we are being cautious.
-
We need to store SQL statements in the way reminiscent of the
sqlite_master
table. Conveniently, both_space
and_index
have extendableopts
dictionary. We addsql
there. -
In order to populate in-memory SQLite schema, we process
_space
and_index
and synthesise rows insqlite_master
format. We reuse SQLite original schema loader. -
For spaces that weren't created by SQL and hence lacks
sql
statement, we synthesise one. -
Tarantool own in-memory schema doesn't need
sql
strings — extracted and discarded at schema load time. -
_Space
and_index
are added to SQLite in-memory schema using generic Tarantool/SQLite schema sync facility which is to be defined later. -
CREATE TABLE
creates an entry in_space
using generic table modification opcodes. -
CREATE INDEX
does the same in_index
. -
CREATE VIEW
creates a space without indices (views and tables share the namespace). -
CREATE TRIGGER
needs a new dedicated space. -
After a space or an index is actually created, it must be added to SQLite in-memory schema. Vanilla SQLite performed
SELECT
from thesqlite_master
to fetch rows defining the new entity and invoked the schema loader on them. -
There's no
sqlite_master
anymore. We have two options — either rely on the generic schema sync facility, or generate the row data programmatically. The later is possible since the VDBE program had inserted these rows in the first place. We're going to evaluate both approaches.
Tarantool already has the mechanism similar to SQLite's schema version cookie. When the schema changes, purge in-memory schema and perform a full reload.
Once a space is updated propagate changes into SQLite.
This is similar to B but delays propagation until a transaction commit.
A is not particularly efficient since the in-memory schema is often purged and reloaded from scratch.
B is efficient since only the pointed changes are made to the in-memory schema. Additionally, B allows for a simpler VDBE program, since the in-memory schema syncs automatically once a new space or index is created (see item 13 above).
However, B is notoriously more complex — not only do we need to take action when a new record appears or the old one disappears, we should be capable of executing arbitrary changes and guessing the intent. Ex: ALTER TABLE
changes the sql statement. But only limited changes are supported in ALTER TABLE
. When sql
statement changes we may only drop and reload portions in the in-memory schema that describe the table being changed.
Since transactions are broken C doesn't make sense at all. Additionally, C violates VDBE semantics — once a table is created it is ready to receive data in the same program, but C defers table creation until COMMIT
.
Yes, they are. It's not possible to modify _space
or _index
in a multi-statement transaction. Hence some operations that ought to be atomic — ex: table creation are performed in multiple sequential transactions:
INSERT INTO _space
COMMIT
- For each implicit index
INSERT INTO _index
COMMIT
I admit that B may create issues when the trigger installed by the schema sync facility propagates changes but the subsequent trigger in the chain ultimately aborts the transaction. End result — ‘ghost’ table in the SQLite in-memory table that may alias other spaces or simply fail any operation attempted.
However, more severe issues are possible due to the lack of atomicity.
We're doing B, approved by @kostja.
Architecture Specifications
- Server architecture
- Feature specifications
- What's in a good specification
- Functional indexes
- Space _index structure
- R tree index quick start and usage
- LuaJIT
- Vinyl
- SQL
- Testing
- Performance
How To ...?
- ... add new fuzzers
- ... build RPM or Deb package using packpack
- ... calculate memory size
- ... debug core dump of stripped tarantool
- ... debug core from different OS
- ... debug Lua state with GDB
- ... generate new bootstrap snapshot
- ... use Address Sanitizer
- ... collect a coredump
Lua modules
Useful links