Relational database schemas for subgraphs #1020

Jannis · 2019-06-27T16:12:26Z

This is required for #297.

lutter · 2019-08-05T00:27:32Z

Goals

We want to switch the storage of entities from JSONB to a relational schema, meaning that each entity type is stored in its own table. The relational schema for a subgraph is generated from the GraphQL schema.

Storage in a relational schema will be added in such a way that already existing subgraphs that use JSONB storage will continue to function. Only newly deployed subgraphs will be stored in a relational schema.

Migrating existing subgraphs to relational storage is not currently planned.

Mapping GraphQL to a relational schema

The mapping from GraphQL to SQL is generated from the GraphQL schema we already store, and needs to be passed into Store.create_subgraph_deployment. The store will need to be able to load the mapping from the database by itself so that it can perform queries against a subgraph. The generated mapping should be cached in the store.

We will continue to store all entities for a subgraph in a subgraph-specific database schema/namespace.

Each type in the GraphQL schema is mapped to a database table, and each attribute of the type is stored in a column of that table. GraphQL types are mapped to SQL types in the following manner:

GraphQL type	SQL type
Boolean	boolean
BigInt	numeric
Bytes	bytea
BigDecimal	numeric
ID	varchar or bytea
Int	integer
String	varchar
List	array

Note that for ID, we generally use varchar, but the block explorer uses bytea for the ID property.

We will only support lists of scalars, i.e., not lists of lists.

Not-null constraints in GraphQL will be mapped to not null constraints in SQL.

Properties that reference other entities, either a single entity, or an array of entities, will be stored as the ID type or as an array of ID.

For example, the GraphQL entity type

type Musician @entity {
    id: ID!
    name: String!
    mainBand: Band
    bands: [Band!]!
    writtenSongs: [Song]! @derivedFrom(field: "writtenBy")
}

would be stored in a SQL table

create table musicians (
        id                   varchar primary key,
        name                 varchar not null,
        main_band            varchar,
        bands                varchar[] not null
     -- derived fields (not stored in this table)
     -- written_songs        varchar[] not null
);

Interfaces

The biggest headache in generating a relational schema for a GraphQL schema are interfaces. The fact that a certain entity implements an interface has no bearing on the relational schema, but GraphQL attributes that reference interfaces need to be treated specially.

Assume we have the GraphQL schema

interface Pet {
    name: String!
}

type Cat implements Pet @entity {
    id: ID!,
    name: String!
}

type Dog implements Pet @entity {
    id: ID!,
    name: String!
}

type Owner @entity {
    id: ID!,
    firstPet: Pet!,
    pets: [Pet!]!
}

The tables for Cat and Dog are straightforward, but the table for Owner needs to deal with a polymorphic reference. This blog post has a nice overview of how to store such references. Since we do not enforce referential integrity in our database schema, we can use the Polymorphic Joins approach from the post, so that the table for Owner would look like

create table owners (
    id             varchar primary key,
    first_pet_type varchar,
    first_pet_id   varchar,
    pets_type      varchar[],
    pets_id        varchar[]
    -- add check constraints to ensure that 
    -- first_pet_type and pets_type only contain
    -- 'Cat' and 'Dog'
)

Serialization

Currently, we serialize entities into JSON objects before storing them in the database. For relational schemas, we will need to convert each attribute of an entity to the corresponding SQL type, and use attribute values directly in queries.

Writing entities

Apart from serialization, writing entities is fairly straightforward, and a Musician entity could be added to the store with a query like

insert into musicians(id, name, main_band, bands)
  values('lennon', 'John Lennon', 'Beatles', '{beatles,quarrymen}')

The above will of course use bind variables in real code. Updating and deleting entities will be handled similarly.

Querying entities

Currently, we only ever query one entity type in the same query; queries against the relational schema therefore only ever reference a single table. The biggest change for querying will be in how we turn an EntityFilter into conditions for a SQL query's where clause, specifically how attributes are accessed in store/postgres/src/filter.rs. With JSONB, the name of a musician is accessed as data->'name'->>'data'. With relational storage, this becomes just name.

Time-travel queries

We will not implement the current scheme of storing entity history in a separate history table; instead, we will follow the approach outlined in this issue

All tables that are generated will have a block_range column added that indicates for which range of blocks an entity version is valid. The id column will no longer be the primary key, instead, we use the exclude constraint shown in the above issue to ensure there is only ever one valid version for an entity at any given block.

The time-travel support will also be used to perform block reversions.

leoyvens · 2019-08-05T17:54:19Z

Plan looks good. I have a comment about interfaces. Currently we do not require mappings to tell us the concrete type of an interface relationship, and instead require all types implementing an interface to not have conflicting ids. So we wouldn't know what to put in first_pet_type, and instead know only first_pet_id and have to join with all implementors of Pet when traversing the relationship.

lutter · 2019-08-05T19:12:48Z

We can create schema that way - it'll cause more work when querying, since we have to look at all Pet entities, but is doable.

Jannis · 2019-08-14T17:47:52Z

@leoyvens One thing we could do there is to perform lookups by ID/IDs to identify the type before writing the references to the db. This requires the referenced entities to already exist before referencing them — which feels like a good constraint.

@lutter A few comments:

The blog post linked for polymorphic joins only demonstrates 1->1 or n->1 relationships but not 1->n relationships like in your example. How does it work in your example?
One thing we're only partially supporting right now is GraphQL's union types, e.g. union Pet = Cat | Dog. Assuming we require Cat and Dog to be concrete entity types, it looks like we can implement them similarly to interfaces: dedicated entity tables for the concrete Cat and Dog types, plus type hints when referencing 1 or more pets. Thinking about it, it shouldn't be hard to support unions involving other unions and interfaces as well, because we can resolve them to all candidate types when we need to.
What I'm missing is an implementation plan (with rough estimates). Could you break the work down a bit, thinking about the sequencing of it?

Other than that, this looks good.

lutter · 2019-08-15T14:49:31Z

@leoyvens One thing we could do there is to perform lookups by ID/IDs to identify the type before writing the references to the db. This requires the referenced entities to already exist before referencing them — which feels like a good constraint.

We made a decision to not enforce referential integrity when writing data at all. According to @yanivtal, we want to make it possible to write an entity that references a Pet before we write the actual Pet entity; and with that, we can't know the type of Pet we are dealing with. References to interfaces would therefore be stored like any other reference, as the ID, and resolution whether it is a Cat or Dog would happen when we run a query.

* The blog post linked for polymorphic joins only demonstrates 1->1 or n->1 relationships but not 1->n relationships like in your example. How does it work in your example?

The relational schema follows the GraphQL schema pretty closely: for 1->n relationships, we'd store an array of ID.

* One thing we're only partially supporting right now is GraphQL's `union` types, e.g. `union Pet = Cat | Dog`. Assuming we require `Cat` and `Dog` to be concrete entity types, it looks like we can implement them similarly to interfaces: dedicated entity tables for the concrete `Cat` and `Dog` types, plus type hints when referencing 1 or more pets. Thinking about it, it shouldn't be hard to support unions involving other unions and interfaces as well, because we can resolve them to all candidate types when we need to.

Yes, I think on the storage side, unions and interfaces are pretty much the same, both for storing references to them, as well as when we query them.

lutter · 2019-09-11T01:43:39Z

Huge chunk of this was merged with PR #1159.

Remaining work:

More testing by indexing existing subgraphs with the relational schema in a separate graph-node instance
Modify relational schema to only index string prefix Modify relational schema to only index string prefix #1137
Relational schema: handle GraphQL enums Relational schema: handle GraphQL enums #1161
Cache storage layout across transactions Cache storage layout across transactions #1177
Properly create HistoryEvent Relational schema: properly create HistoryEvent #1188
Relational schema: add synthetic primary key Relational schema: add synthetic primary key #1164
Do not prematurely convert data to JSONB Relational schema: do not prematurely convert data to JSONB #1207
Reject subgraphs with invalid @derivedFrom fields Reject subgraphs with invalid @derivedFrom fields #1213
Make relational schemas the default for storing subgraphs Make relational schemas the default for storing subgraphs #1246

Jannis · 2019-10-28T14:34:51Z

@lutter I suppose we can close this, right?

lutter · 2019-10-28T15:00:17Z

Yep!

Jannis assigned lutter Jun 27, 2019

Jannis added area/store area/subgraphs component/graph-node performance labels Jun 27, 2019

leoyvens mentioned this issue Jul 11, 2019

IPFS data sources #1017

Closed

lutter mentioned this issue Sep 4, 2019

Relational storage and time-travel query support for entities #1159

Merged

lutter mentioned this issue Sep 23, 2019

Block reversion / time-travel entity storage #1124

Closed

lutter closed this as completed Oct 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relational database schemas for subgraphs #1020

Relational database schemas for subgraphs #1020

Jannis commented Jun 27, 2019 •

edited

Loading

lutter commented Aug 5, 2019 •

edited

Loading

leoyvens commented Aug 5, 2019

lutter commented Aug 5, 2019

Jannis commented Aug 14, 2019 •

edited

Loading

lutter commented Aug 15, 2019

lutter commented Sep 11, 2019 •

edited

Loading

Jannis commented Oct 28, 2019

lutter commented Oct 28, 2019

Relational database schemas for subgraphs #1020

Relational database schemas for subgraphs #1020

Comments

Jannis commented Jun 27, 2019 • edited Loading

lutter commented Aug 5, 2019 • edited Loading

Goals

Mapping GraphQL to a relational schema

Interfaces

Serialization

Writing entities

Querying entities

Time-travel queries

leoyvens commented Aug 5, 2019

lutter commented Aug 5, 2019

Jannis commented Aug 14, 2019 • edited Loading

lutter commented Aug 15, 2019

lutter commented Sep 11, 2019 • edited Loading

Jannis commented Oct 28, 2019

lutter commented Oct 28, 2019

Jannis commented Jun 27, 2019 •

edited

Loading

lutter commented Aug 5, 2019 •

edited

Loading

Jannis commented Aug 14, 2019 •

edited

Loading

lutter commented Sep 11, 2019 •

edited

Loading