Skip to content

GSoC 2021 ideas

Timur Safin edited this page Aug 23, 2021 · 1 revision

Google Summer of Code 2021 ideas

Please find below the preliminary list of ideas we would like to explore during this season of Google Summer of Code. Each idea has corresponding difficulty level, required/optional skill sets, and mentor names.

Not all ideas have same amount of detailiness in explanation, but we believe they might be of a good start in interesting direction.

Here is the link to GSoC-2021 profile for Tarantool organization https://summerofcode.withgoogle.com/organizations/4813527870603264/

Table of Contents

Database engines

Add column storage engine

Tarantool storage engines MemTx and Vinyl are row (tuple) based, which is working fine for OLTP activity, but is suboptimal for OLAP queries. There is infrastructure in Tarantool to switch engines for spaces, and we may experiment with column based storage. As a bonus, for cluster mode, column storage may store different column families on different nodes for better locality. Such approach may provide (or may be not) better scalability for OLAP queries in a wide-column cluster.

  • We need to experiment with local column storage scenario
  • And wide-column scenario for column families spread over cluster nodes;
  • Measure benefits and overhead;

Difficulty: Hard/Infinite

Required skills: C, C++

Mentor: Kirill Yukhin

Add B+ Tree storage engine

Persistent, LSM-based Tarantool Vinyl storage engine provides good performance for write-only activities, but is suboptimal for read queries, especially if one query cold data. On the other hand B+ tree based storage engines better fit the scenarios when we not only write data (across clusters) but also need to query cold data for simple analytic needs. There is infrastructure in Tarantool to switch engines for spaces, and we may experiment with B+ tree based storage.

  • We need to create this engine, plug it in to the generic query flow in Tarantool (making sure that Lua and SQL requests still work for new engine);
  • We need to compare its performance against Vinyl. Measure memory benefits or losses.

Difficulty: Hard/Infinite

Required skills: C, C++

Mentor: Timur Safin, Cyrill Gorkunov

ML-based memtx index optimizations

Machine learning is hot today. There are claims that we could get an optimal balance between memory and performance if we proceed ML woodoo over working set data. This sounds too attractive for us to miss this opportunity and not try ML in Tarantool.

References:

Difficulty: Hard

Required skills: C, C++

Mentor: Nikitta Pettik, Alexander Lyapunov

Vinyl:

Calculate bloom filter and page size automatically:

Vinyl is a disk storage based on LSM tree data structure. To speed up data access (which may take a while for a disk storage) we use probabilistic data structure - bloom filter. It allows us to eliminate excess data lookups - we can skip some files on disk if bloom filter application returns negative result. We should adjust bloom filters and page size automatically to fit a pre-defined page index size. We have the following automatic adjustment strategies, which we should all employ:

  • avoid having bloom filters on lower levels; the deeper the LSM level, the less value per byte used they provide
  • increase page size (again, starting from the bottom-most pages) - the larger is the page size, the smaller is the page index.

Related issues :

Difficulty: Medium

Required skills: C

Mentor: Nikitta Pettik

Implement persistent tuple cache:

To speed up data access in Vinyl we also maintain cache of tuples (to be more precise - chains of tuples) in RAM. However, it is not persistent, i.e. in instance reload its content is erased. It would be nice to dump tuple cache during checkpoint, and recover it during vinyl recovery.

Related issues :

vinyl: persistent tuple cache #2065

Difficulty: Hard

Required skills: C

Mentor: Nikitta Pettik

LuaJIT

Fuzzing Lua (LuaJIT) in Tarantool

LuaJIT is a heart of Tarantool and the stability of it is closely connected to the stability of LuaJIT. But statistics of issues said that we catch bugs with LuaJIT [1] using Tarantool on production. We need to decrease the probability of such bugs and introduce more generative testing for LuaJIT in the Tarantool release cycle.

Some ideas for testing are:

  • Metamorphic testing using existing corpus of Lua scripts.
  • Automated generation of scripts using Lua grammar and introspection of Tarantool embedded functions.
  • Fault injection in memory allocation using embedded error injection engine (see src/lib/core/errinj.c) or using external library with LD_PRELOAD

References

  1. LuaJIT/LuaJIT#568
  2. https://github.com/tarantool/tarantool/issues?q=is%3Aissue+is%3Aopen+label%3Acrash+label%3Aluajit
  3. https://github.com/tarantool/tarantool/issues/4823

Difficulty: Moderate

Required skills: C/C++, Lua

Desired skills: LuaJIT internals, metamorphic testing, fuzzing testing, LibFuzzer

Mentors: Sergey Bronnikov, Igor Munkin

Pluggable JIT engine - experiment with MIR for Lua

LuaJIT is a very decent JIT engine, but with unclear support status and non-reliable release strategy. It might be easier to maintain in the longer term if we switch from LuaJIT to MIR JIT which is being actively developed recently.

We might proceed with an experiment switching from LuaJIT as JIT for Lua code to MIR. Measure JIT overhead, performance of generated code.

Links:

Difficulty: Hard

Required skills: C, C++

Mentor: Timur Safin, Sergey Ostanevich, Igor Munkin

Ecosystem

Implement Tarantool/LuaJIT backend for Compiler Explorer

Code Explorer also known as Godbolt (godbolt.org) became the greatest community tool used by multiple languages community for code share and performance investigations. It started with C++ compilers, but now it includes Ada, Go, Haskell, and many others (and more to come).

Unfortunately, there is no LuaJIT/Tarantool support and furthermore there is no any good substitution for missing godbolt inside of LuaJIT community.

There is a nice LuaJIT web inspector by @mejedi though, which shows bytecode IR and generated x86 assembly for the given Lua script.

  • But it provides only a few configurations for original LuaJIT;
  • And there is no way to generate non-x86 target assembly.

It looks like that godbolt.org is the best infrastructure which should be used here:

  • It has script persistence and short-cutter activated;
  • It has facilities to run any compiler inside of Docker VM;
  • It has builtin facilities to parse compiler assembly and for intercepting and redirecting it to disassembly (or IR) panes inside of Compiler Explorer.

Difficulty: Medium

Required skills: TS/JS, Docker

Optional skills: C/C++, LuaJIT

Mentor: Timur Safin, Igor Munkin

Introduce dynamic trace probes to Tarantool kernel for SystemTap/dtrace

Dynamic tracing probes as introduced to DTrace in Solaris, have been ported to BSD*, Linux and even Windows lately. Also Linux has SystemTap probing facilities which support similar tracing functionality but with less ergonomics.

It's common approach to introduced trace probes to database engines so it will become easier later, at production site to collect data from various system internals.

You should add tracing probes, evaluate overhead, and provide tutorial guidelines for using DTrace/SystemTap for Tarantool performance investigations.

Difficulty: Medium

Required skills: C/C++, Linux kernel

Mentor: Cyrill Gorkunov, Timur Safin

SQL

Migrate Tarantool SQL engine VM to JIT engine

There are multiple virtual machines embedded in Tarantool:

  • LuaJIT with its 2 operating modes (Lua bytecode interpreter, and JIT traces with native code generated);
  • And SQLite-derivative VM code for SQL engine known as vdbe.

Vdbe is not providing best performance and looks alien to the rest of Tarantool code base. Worth to explore JIT engine for SQL:

  • Either we should try to get rid of vdbe byte-code VM and try to reduce the number of VMs used via adopting LuaJIT infrastructure in SQL;

  • or we could explore MIR as JIT infrastructure. MIR is looking promising here, as being under active development at the moment, and as not architected for dynamic types systems.

  • Another opportunity is to investigate is embed LLVM JIT engine. A good reference for this point is PostgreSQL: they introduced experimental option allowing to JIT subset of SQL expressions: machinery is quite similar to ours. Firstly, they compile expression to high-level bytecode, then convert it to llvm-bitcode. We already have some prototypes on this subject. Since the task is sophisticated enough, we can provide them as starting point.

Regardless of an approach selected student will need to proceed these steps:

  • To migrate from vdbe bytecode generation to the selected JIT system;
  • Measure memory and performance differences;
  • Run SQL benchmarks (i.e. TPC-C, TPC-H, etc.) and to verify we not lost performance;

Difficulty: Hard

Required skills: C++, JIT

Optional skills: LuaJIT, SQLite

Mentor: Timur Safin, Nikita Pettik, Igor Munkin

Implement foreign key constraints as NoSQL triggers

Tarantool provides support for a decent part of SQL standard. It is possible to create foreign key constraints on SQL tables. However, SQL queries are executed in a separate environment (query is compiled into bytecode which is executed in an internal virtual machine). This makes foreign keys invisible for NoSQL operations which in turn allow users to break data consistency.

To fix the problem we should implement foreign key constraints as NoSQL replace triggers: once foreign key is created, bytecode implementing this constraint is saved in trigger; corresponding VM is set up and waits in "hot-standby" mode. Such a trigger is fired on any data manipulation of given space (i.e. table in SQL terms). When it is on, bytecode is executed and depending on the result of check - further data manipulation is allowed or not. Similar machinery is already used for CHECK constraints.

Related issues :

Difficulty : Medium+

Required skills : C

Optional skills : SQL

Mentor : Nikita Pettik

Prepared statements enhancements

As many other solid databases, Tarantool supports SQL. Query execution consists of two stages: compilation into high-level bytecode and its execution in a separate virtual machine. Prepared statement is a feature allowing to execute the same (parameterized) statement repeatedly without recompilation overhead.

Still there's some space for improvements and optimizations, such as:

  • don't obsolete prepared statement on certain schema updates;
  • allow to set types of parameters to be bound;
  • copy-on-write VM when the same prepared statement is executed in parallel.

Related issues :

Difficulty : Easy

Required skills : C

Optional skills : SQL

Mentor : Nikita Pettik

Redesign result set processing, making it preemptible

Resulting set is a product of SQL query execution. It consists of two parts: metadata containing column types, names, collations and actually data (i.e. tuples). Accessing a particular field in the tuple can be achieved only by its numeric index. Meanwhile NoSQL functions (e.g. space:select()) give us an ability to access fields by names. So the first idea is to retrieve data in SQL resulting set by column names.

Moreover, execution of the query hangs the database until all rows are processed and then returns all rows at once. This is unacceptable for queries filtering million rows. Instead it is suggested to return resulting tuples by batches (i.e. "portions"). Support of this feature should be also added to IPROTO (Tarantool network protocol) - it is possible using pushes (https://github.com/tarantool/tarantool/blob/master/doc/rfc/3328-wire\_protocol.md).

Related issues :

Difficulty : Medium/Hard

Required skills : C

Mentor : Nikita Pettik

Export/import .sql database

It would be nice to have an opportunity to load SQL database from file as well as generating and dumping SQL schema for any space on disk. This feature is implemented in many other databases, so many Tarantool users want it in action!

Related issues :

Difficulty : Easy/Medium

Required skills : C

Optional skills : SQL

Mentor : Nikita Pettik

Support HASH indexes in SQL

Tarantool NoSQL Memtx engine provides several index types: BPS-Tree, Hash, R-Tree, BitSet. Currently the only one index type supported in SQL is BPS-Tree.

Tree and hash indexes are the most popular ones and have different sets of pros and cons. Unfortunately Tree and Hash indexes feature a bit different low-level API: for example there are only three iterator types for hash indexes (ALL, EQ and GT), meanwhile tree has all seven iterator types. Moreover, there's different time complexity of retrieval operations (hash as a rule is faster than tree, but consumes more memory), so query planner should be aware of it and choose a faster index.

Related issues :

Difficulty : Medium+

Required skills : C

Optional skills : SQL

Mentor : Nikita Pettik

Support UPSERT operation in SQL

Tarantool NoSQL features specific operation - UPSERT. Meanwhile for memtx engine it immediately folds into insert or update operations, for vinyl engine it has other application rules. Namely, upserts are stored as records on disk and applied (squashed) only when the key they are affecting is selected. So upsert operation is vital for Vinyl.

Unfortunately, now there's no way to generate upsert operation using SQL. The suggestion is to introduce an SQL statement (extension for ANSI) which would allow users to generate upsert operations.

For sure, there are several limitations to using pure (i.e. unfolded) upsert operation: absence of foreign key constraints, secondary indexes etc.

In scope of this issue student has to come up with syntax for the new statement; patch our SQL parser to get it to work; patch VM introducing new opcode implementing UPSERT operation; then provide a bytecode generator for the new statement; and finally carefully test new feature.

Related issues :

Difficulty : Medium

Required skills : C

Optional skills : SQL, parser generators

Mentor : Nikita Pettik, Timur Safin

Adapt query planner for disk engine

Originally, the SQL subsystem was borrowed from SQLite. However, SQLite is a disk database using different data structures (LSM disk tree in Tarantool vs disk B-tree in SQLite). Query planner is not optimally tuned for LSM engine, thus queries show suboptimal execution plans. This task allows to dive deeper into the query planner system and adjust it for target engine in order to show better performance.

Related issues :

Difficulty : Hard

Required skills : C

Optional skills : SQL, database internals knowledge

Mentor : Nikita Pettik

Various SQL VM improvements

Beyond introduction of JIT to SQL subsystem there are multiple smaller improvements which might be developed for SQL virtual machine. Choose what you prefer:

  • merge CHECKs bytecode into a monolithic program;
  • add VDBE assembler;
  • use threaded interpretation.

Related issues :

Difficulty : Medium

Required skills : C

Optional skills : SQL, compilers

Mentor : Nikita Pettik, Timur Safin

Allow to create check constraints on space with data in it.

CHECK constraints is a part of SQL which was successfully integrated into NoSQL Tarantool. However, there's still one restriction - user is able to create CHECK constraint only if space hasn't any data in it. We can avoid this limitation using the same mechanism which is invoked we verify uniqueness while building new index. To be more precise: create iterator over the stored data and check each stored tuple for satisfying given constraint.

Related issues :

Difficulty : Easy

Required skills : C

Mentor : Nikita Pettik, Timur Safin

Types: Introduce type MAP/ARRAY to SQL

To finish SQL integration with Tarantool NoSQL, we have to allow users to interact with all existing field types. Otherwise, users are not able to operate on spaces containing at least one of field with type which is unsupported (e.g. see #4758 and #4757). One of those types is MAP, another one - ARRAY. They are not described in ANSI SQL, so we should come up with own rules for interacting with them. To deal with this issue student has to proceed following steps:

  • Patch SQL parser in order to support new types (keywords, syntax constructions);
  • Patch internals of SQL frontend (byte-code generator, query planner) to integrate new types into existing system;
  • Patch SQL runtime VM: come up with idea how to store these types in memory, how to operate on values etc.

Related issues :

Difficulty : Medium

Required skills : C

Optional skills : SQL

Mentor : Nikita Pettik, Timur Safin

Make VIEWs queryable via space:select()

Connectors

Tarantool C++ connector enhancements

Recently we've released a new C++ connector, which is based on compile-time MsgPack (MsgPack is internal format for data stored in Tarantool) encoder/decoder. Reaching the limits of C++, this library is really state of the art, eliminating any possible overheads.

There are many tasks related to it:

  • Port connector to Windows/MacOS (rewrite network engine);
  • Complete providing support of Tarantool features (pushes, SQL);
  • Add wrappers for DDL (so that user can write box.create_space("test"); space["test"].create_index("i"));
  • Implement a set of optimizations.

Difficulty : Medium/Hard

Required skills : C++

Optional skills : compile time state machines, IPC

Mentor : Nikita Pettik

MsgPuck AVX optimizations

There are known JSON parsing implementations which are AVX optimized (e.g. simdjson) and which significantly improve performance of json parsing using simd instructions on cpu. There is belief that msgpack parsing might be done similarly, using simd instructions given proper understanding that could be vectorized in which case. We want to check that this task is solvable and could provide some performance gain.

Difficulty : Medium

Required skills : C, C++, AVX x86

Mentor : Kirill Yukhin

Clone this wiki locally